linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4
@ 2014-01-16  1:24 Matthew Wilcox
  2014-01-16  1:24 ` [PATCH v5 01/22] Fix XIP fault vs truncate race Matthew Wilcox
                   ` (28 more replies)
  0 siblings, 29 replies; 45+ messages in thread
From: Matthew Wilcox @ 2014-01-16  1:24 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-ext4; +Cc: Matthew Wilcox

This series of patches add support for XIP to ext4.  Unfortunately,
it turns out to be necessary to rewrite the existing XIP support code
first due to races that are unfixable in the current design.

Since v4 of this patchset, I've improved the documentation, fixed a
couple of warnings that a newer version of gcc emitted, and fixed a
bug where we would read/write the wrong address for I/Os that were not
aligned to PAGE_SIZE.

I've dropped the PMD fault patch from this set since there are some
places where we would need to split a PMD page and there's no way to do
that right now.  In its place, I've added a patch which attempts to add
support for unwritten extents.  I'm still in two minds about this; on the
one hand, it's clearly a win for reads and writes.  On the other hand,
it adds a lot of complexity, and it probably isn't a win for pagefaults.

Patch 2 wants careful review from the MM people.

Matthew Wilcox (21):
  Fix XIP fault vs truncate race
  Allow page fault handlers to perform the COW
  axonram: Fix bug in direct_access
  Change direct_access calling convention
  Introduce IS_XIP(inode)
  Treat XIP like O_DIRECT
  Rewrite XIP page fault handling
  Change xip_truncate_page to take a get_block parameter
  Remove mm/filemap_xip.c
  Remove get_xip_mem
  Replace ext2_clear_xip_target with xip_clear_blocks
  ext2: Remove ext2_xip_verify_sb()
  ext2: Remove ext2_use_xip
  ext2: Remove xip.c and xip.h
  Remove CONFIG_EXT2_FS_XIP
  ext2: Remove ext2_aops_xip
  xip: Add xip_zero_page_range
  ext4: Make ext4_block_zero_page_range static
  ext4: Fix typos
  xip: Add reporting of major faults
  XIP: Add support for unwritten extents

Ross Zwisler (1):
  ext4: Add XIP functionality

 Documentation/filesystems/Locking  |   3 -
 Documentation/filesystems/ext4.txt |   2 +
 Documentation/filesystems/xip.txt  | 126 +++++-----
 arch/powerpc/sysdev/axonram.c      |   8 +-
 drivers/block/brd.c                |   8 +-
 drivers/s390/block/dcssblk.c       |  19 +-
 fs/Kconfig                         |  21 +-
 fs/Makefile                        |   1 +
 fs/exofs/inode.c                   |   1 -
 fs/ext2/Kconfig                    |  11 -
 fs/ext2/Makefile                   |   1 -
 fs/ext2/ext2.h                     |   1 -
 fs/ext2/file.c                     |  43 +++-
 fs/ext2/inode.c                    |  35 +--
 fs/ext2/namei.c                    |   9 +-
 fs/ext2/super.c                    |  38 ++-
 fs/ext2/xip.c                      |  91 -------
 fs/ext2/xip.h                      |  26 --
 fs/ext4/ext4.h                     |   4 +-
 fs/ext4/file.c                     |  53 +++-
 fs/ext4/indirect.c                 |  19 +-
 fs/ext4/inode.c                    | 106 +++++---
 fs/ext4/namei.c                    |  10 +-
 fs/ext4/super.c                    |  39 ++-
 fs/open.c                          |   5 +-
 fs/xip.c                           | 428 ++++++++++++++++++++++++++++++++
 include/linux/blkdev.h             |   4 +-
 include/linux/fs.h                 |  47 +++-
 include/linux/mm.h                 |   2 +
 mm/Makefile                        |   1 -
 mm/fadvise.c                       |   6 +-
 mm/filemap.c                       |   6 +-
 mm/filemap_xip.c                   | 483 -------------------------------------
 mm/madvise.c                       |   2 +-
 mm/memory.c                        |  19 +-
 35 files changed, 851 insertions(+), 827 deletions(-)
 delete mode 100644 fs/ext2/xip.c
 delete mode 100644 fs/ext2/xip.h
 create mode 100644 fs/xip.c
 delete mode 100644 mm/filemap_xip.c

-- 
1.8.5.2


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v5 01/22] Fix XIP fault vs truncate race
  2014-01-16  1:24 [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Matthew Wilcox
@ 2014-01-16  1:24 ` Matthew Wilcox
  2014-01-16  1:24 ` [PATCH v5 02/22] Allow page fault handlers to perform the COW Matthew Wilcox
                   ` (27 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2014-01-16  1:24 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-ext4; +Cc: Matthew Wilcox

Pagecache faults recheck i_size after taking the page lock to ensure that
the fault didn't race against a truncate.  We don't have a page to lock
in the XIP case, so use the i_mmap_mutex instead.  It is locked in the
truncate path in unmap_mapping_range() after updating i_size.  So while
we hold it in the fault path, we are guaranteed that either i_size has
already been updated in the truncate path, or that the truncate will
subsequently call zap_page_range_single() and so remove the mapping we
have just inserted.

There is a window of time in which i_size has been reduced and the
thread has a mapping to a page which will be removed from the file,
but this is harmless as the page will not be allocated to a different
purpose before the thread's access to it is revoked.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 mm/filemap_xip.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d8d9fe3..c8d23e9 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -260,8 +260,17 @@ again:
 		__xip_unmap(mapping, vmf->pgoff);
 
 found:
+		/* We must recheck i_size under i_mmap_mutex */
+		mutex_lock(&mapping->i_mmap_mutex);
+		size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+							PAGE_CACHE_SHIFT;
+		if (unlikely(vmf->pgoff >= size)) {
+			mutex_unlock(&mapping->i_mmap_mutex);
+			return VM_FAULT_SIGBUS;
+		}
 		err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
 							xip_pfn);
+		mutex_unlock(&mapping->i_mmap_mutex);
 		if (err == -ENOMEM)
 			return VM_FAULT_OOM;
 		/*
@@ -285,16 +294,27 @@ found:
 		}
 		if (error != -ENODATA)
 			goto out;
+
+		/* We must recheck i_size under i_mmap_mutex */
+		mutex_lock(&mapping->i_mmap_mutex);
+		size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+							PAGE_CACHE_SHIFT;
+		if (unlikely(vmf->pgoff >= size)) {
+			ret = VM_FAULT_SIGBUS;
+			goto unlock;
+		}
 		/* not shared and writable, use xip_sparse_page() */
 		page = xip_sparse_page();
 		if (!page)
-			goto out;
+			goto unlock;
 		err = vm_insert_page(vma, (unsigned long)vmf->virtual_address,
 							page);
 		if (err == -ENOMEM)
-			goto out;
+			goto unlock;
 
 		ret = VM_FAULT_NOPAGE;
+unlock:
+		mutex_unlock(&mapping->i_mmap_mutex);
 out:
 		write_seqcount_end(&xip_sparse_seq);
 		mutex_unlock(&xip_sparse_mutex);
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 02/22] Allow page fault handlers to perform the COW
  2014-01-16  1:24 [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Matthew Wilcox
  2014-01-16  1:24 ` [PATCH v5 01/22] Fix XIP fault vs truncate race Matthew Wilcox
@ 2014-01-16  1:24 ` Matthew Wilcox
  2014-01-16  1:24 ` [PATCH v5 03/22] axonram: Fix bug in direct_access Matthew Wilcox
                   ` (26 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2014-01-16  1:24 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-ext4; +Cc: Matthew Wilcox

Currently COW of an XIP file is done by first bringing in a read-only
mapping, then retrying the fault and copying the page.  It is much more
efficient to tell the fault handler that a COW is being attempted (by
passing in the pre-allocated page in the vm_fault structure), and allow
the handler to perform the COW operation itself.

Where the filemap code protects against truncation of the file until
the PTE has been installed with the page lock, the XIP code use the
i_mmap_mutex instead.  We must therefore unlock the i_mmap_mutex in
__do_fault().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 include/linux/mm.h |  2 ++
 mm/memory.c        | 19 ++++++++++++++++---
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1cedd00..e07c57c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -195,6 +195,7 @@ struct vm_fault {
 	pgoff_t pgoff;			/* Logical page offset based on vma */
 	void __user *virtual_address;	/* Faulting virtual address */
 
+	struct page *cow_page;		/* Handler may choose to COW */
 	struct page *page;		/* ->fault handlers should return a
 					 * page here, unless VM_FAULT_NOPAGE
 					 * is set (which is also implied by
@@ -958,6 +959,7 @@ static inline int page_mapped(struct page *page)
 #define VM_FAULT_HWPOISON 0x0010	/* Hit poisoned small page */
 #define VM_FAULT_HWPOISON_LARGE 0x0020  /* Hit poisoned large page. Index encoded in upper bits */
 
+#define VM_FAULT_COWED	0x0080	/* ->fault COWed the page instead */
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
diff --git a/mm/memory.c b/mm/memory.c
index 5d9025f..3f1b666 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2673,6 +2673,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			vmf.pgoff = old_page->index;
 			vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
 			vmf.page = old_page;
+			vmf.cow_page = NULL;
 
 			/*
 			 * Notify the address space that the page is about to
@@ -3335,11 +3336,18 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	vmf.pgoff = pgoff;
 	vmf.flags = flags;
 	vmf.page = NULL;
+	vmf.cow_page = cow_page;
 
 	ret = vma->vm_ops->fault(vma, &vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
 			    VM_FAULT_RETRY)))
 		goto uncharge_out;
+	if (unlikely(ret & VM_FAULT_COWED)) {
+		page = cow_page;
+		anon = 1;
+		__SetPageUptodate(page);
+		goto cowed;
+	}
 
 	if (unlikely(PageHWPoison(vmf.page))) {
 		if (ret & VM_FAULT_LOCKED)
@@ -3399,6 +3407,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	}
 
+ cowed:
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 
 	/*
@@ -3465,9 +3474,13 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (vma->vm_file && !page_mkwrite)
 			file_update_time(vma->vm_file);
 	} else {
-		unlock_page(vmf.page);
-		if (anon)
-			page_cache_release(vmf.page);
+		if ((ret & VM_FAULT_COWED)) {
+			mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+		} else {
+			unlock_page(vmf.page);
+			if (anon)
+				page_cache_release(vmf.page);
+		}
 	}
 
 	return ret;
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 03/22] axonram: Fix bug in direct_access
  2014-01-16  1:24 [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Matthew Wilcox
  2014-01-16  1:24 ` [PATCH v5 01/22] Fix XIP fault vs truncate race Matthew Wilcox
  2014-01-16  1:24 ` [PATCH v5 02/22] Allow page fault handlers to perform the COW Matthew Wilcox
@ 2014-01-16  1:24 ` Matthew Wilcox
  2014-01-16  1:24 ` [PATCH v5 04/22] Change direct_access calling convention Matthew Wilcox
                   ` (25 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2014-01-16  1:24 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-ext4; +Cc: Matthew Wilcox

The 'pfn' returned by axonram was completely bogus, and has been since
2008.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 arch/powerpc/sysdev/axonram.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 1c16141..1fea249 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -155,7 +155,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector,
 	}
 
 	*kaddr = (void *)(bank->ph_addr + offset);
-	*pfn = virt_to_phys(kaddr) >> PAGE_SHIFT;
+	*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
 
 	return 0;
 }
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 04/22] Change direct_access calling convention
  2014-01-16  1:24 [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Matthew Wilcox
                   ` (2 preceding siblings ...)
  2014-01-16  1:24 ` [PATCH v5 03/22] axonram: Fix bug in direct_access Matthew Wilcox
@ 2014-01-16  1:24 ` Matthew Wilcox
  2014-01-16  1:24 ` [PATCH v5 05/22] Introduce IS_XIP(inode) Matthew Wilcox
                   ` (24 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2014-01-16  1:24 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-ext4; +Cc: Matthew Wilcox

In order to support accesses to larger chunks of memory, pass in a
'size' parameter (counted in bytes), and return the amount available at
that address.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 Documentation/filesystems/xip.txt | 15 +++++++++------
 arch/powerpc/sysdev/axonram.c     |  6 +++---
 drivers/block/brd.c               |  8 +++++---
 drivers/s390/block/dcssblk.c      | 19 ++++++++++---------
 fs/ext2/xip.c                     | 30 +++++++++++++-----------------
 include/linux/blkdev.h            |  4 ++--
 6 files changed, 42 insertions(+), 40 deletions(-)

diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
index 0466ee5..b62eabf 100644
--- a/Documentation/filesystems/xip.txt
+++ b/Documentation/filesystems/xip.txt
@@ -28,12 +28,15 @@ Implementation
 Execute-in-place is implemented in three steps: block device operation,
 address space operation, and file operations.
 
-A block device operation named direct_access is used to retrieve a
-reference (pointer) to a block on-disk. The reference is supposed to be
-cpu-addressable, physical address and remain valid until the release operation
-is performed. A struct block_device reference is used to address the device,
-and a sector_t argument is used to identify the individual block. As an
-alternative, memory technology devices can be used for this.
+A block device operation named direct_access is used to translate the
+block device sector number to a page frame number (pfn) that identifies
+the physical page for the memory.  It also returns a kernel virtual
+address that can be used to access the memory.
+
+The direct_access method takes a 'size' parameter that indicates the
+number of bytes being requested.  The function should return the number
+of bytes that it can provide, although it must not exceed the number of
+bytes requested.  It may also return a negative errno if an error occurs.
 
 The block device operation is optional, these block devices support it as of
 today:
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 1fea249..28e69b0 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -138,9 +138,9 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
  * axon_ram_direct_access - direct_access() method for block device
  * @device, @sector, @data: see block_device_operations method
  */
-static int
+static long
 axon_ram_direct_access(struct block_device *device, sector_t sector,
-		       void **kaddr, unsigned long *pfn)
+		       void **kaddr, unsigned long *pfn, long size)
 {
 	struct axon_ram_bank *bank = device->bd_disk->private_data;
 	loff_t offset;
@@ -157,7 +157,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector,
 	*kaddr = (void *)(bank->ph_addr + offset);
 	*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
 
-	return 0;
+	return min_t(unsigned long, size, bank->size - offset);
 }
 
 static const struct block_device_operations axon_ram_devops = {
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index d91f1a5..fa809cb 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -361,8 +361,8 @@ out:
 }
 
 #ifdef CONFIG_BLK_DEV_XIP
-static int brd_direct_access(struct block_device *bdev, sector_t sector,
-			void **kaddr, unsigned long *pfn)
+static long brd_direct_access(struct block_device *bdev, sector_t sector,
+			void **kaddr, unsigned long *pfn, long size)
 {
 	struct brd_device *brd = bdev->bd_disk->private_data;
 	struct page *page;
@@ -379,7 +379,9 @@ static int brd_direct_access(struct block_device *bdev, sector_t sector,
 	*kaddr = page_address(page);
 	*pfn = page_to_pfn(page);
 
-	return 0;
+	/* Could optimistically check to see if the next page in the
+	 * file is mapped to the next page of physical RAM */
+	return PAGE_SIZE;
 }
 #endif
 
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 6eca019..c09a2b0 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -28,8 +28,8 @@
 static int dcssblk_open(struct block_device *bdev, fmode_t mode);
 static void dcssblk_release(struct gendisk *disk, fmode_t mode);
 static void dcssblk_make_request(struct request_queue *q, struct bio *bio);
-static int dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
-				 void **kaddr, unsigned long *pfn);
+static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
+				 void **kaddr, unsigned long *pfn, long size);
 
 static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
 
@@ -865,25 +865,26 @@ fail:
 	bio_io_error(bio);
 }
 
-static int
+static long
 dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
-			void **kaddr, unsigned long *pfn)
+			void **kaddr, unsigned long *pfn, long size)
 {
 	struct dcssblk_dev_info *dev_info;
-	unsigned long pgoff;
+	unsigned long offset, dev_sz;
 
 	dev_info = bdev->bd_disk->private_data;
 	if (!dev_info)
 		return -ENODEV;
+	dev_sz = dev_info->end - dev_info->start;
 	if (secnum % (PAGE_SIZE/512))
 		return -EINVAL;
-	pgoff = secnum / (PAGE_SIZE / 512);
-	if ((pgoff+1)*PAGE_SIZE-1 > dev_info->end - dev_info->start)
+	offset = secnum * 512;
+	if (offset > dev_sz)
 		return -ERANGE;
-	*kaddr = (void *) (dev_info->start+pgoff*PAGE_SIZE);
+	*kaddr = (void *) (dev_info->start + offset);
 	*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
 
-	return 0;
+	return min_t(unsigned long, size, dev_sz - offset);
 }
 
 static void
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index e98171a..fa40091 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,18 +13,13 @@
 #include "ext2.h"
 #include "xip.h"
 
-static inline int
-__inode_direct_access(struct inode *inode, sector_t block,
-		      void **kaddr, unsigned long *pfn)
+static inline long __inode_direct_access(struct inode *inode, sector_t block,
+				void **kaddr, unsigned long *pfn, long size)
 {
 	struct block_device *bdev = inode->i_sb->s_bdev;
 	const struct block_device_operations *ops = bdev->bd_disk->fops;
-	sector_t sector;
-
-	sector = block * (PAGE_SIZE / 512); /* ext2 block to bdev sector */
-
-	BUG_ON(!ops->direct_access);
-	return ops->direct_access(bdev, sector, kaddr, pfn);
+	sector_t sector = block * (PAGE_SIZE / 512);
+	return ops->direct_access(bdev, sector, kaddr, pfn, size);
 }
 
 static inline int
@@ -53,12 +48,13 @@ ext2_clear_xip_target(struct inode *inode, sector_t block)
 {
 	void *kaddr;
 	unsigned long pfn;
-	int rc;
+	long size;
 
-	rc = __inode_direct_access(inode, block, &kaddr, &pfn);
-	if (!rc)
-		clear_page(kaddr);
-	return rc;
+	size = __inode_direct_access(inode, block, &kaddr, &pfn, PAGE_SIZE);
+	if (size < 0)
+		return size;
+	clear_page(kaddr);
+	return 0;
 }
 
 void ext2_xip_verify_sb(struct super_block *sb)
@@ -77,7 +73,7 @@ void ext2_xip_verify_sb(struct super_block *sb)
 int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
 				void **kmem, unsigned long *pfn)
 {
-	int rc;
+	long rc;
 	sector_t block;
 
 	/* first, retrieve the sector number */
@@ -86,6 +82,6 @@ int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
 		return rc;
 
 	/* retrieve address of the target data */
-	rc = __inode_direct_access(mapping->host, block, kmem, pfn);
-	return rc;
+	rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE);
+	return (rc < 0) ? rc : 0;
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1b135d4..00c8e1a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1566,8 +1566,8 @@ struct block_device_operations {
 	void (*release) (struct gendisk *, fmode_t);
 	int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
-	int (*direct_access) (struct block_device *, sector_t,
-						void **, unsigned long *);
+	long (*direct_access) (struct block_device *, sector_t,
+					void **, unsigned long *pfn, long size);
 	unsigned int (*check_events) (struct gendisk *disk,
 				      unsigned int clearing);
 	/* ->media_changed() is DEPRECATED, use ->check_events() instead */
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 05/22] Introduce IS_XIP(inode)
  2014-01-16  1:24 [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Matthew Wilcox
                   ` (3 preceding siblings ...)
  2014-01-16  1:24 ` [PATCH v5 04/22] Change direct_access calling convention Matthew Wilcox
@ 2014-01-16  1:24 ` Matthew Wilcox
  2014-01-16  1:24 ` [PATCH v5 06/22] Treat XIP like O_DIRECT Matthew Wilcox
                   ` (23 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2014-01-16  1:24 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-ext4; +Cc: Matthew Wilcox

Use an inode flag to tag inodes which should avoid using the page cache.
Convert ext2 to use it instead of mapping_is_xip().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext2/inode.c    | 9 ++++++---
 fs/ext2/xip.h      | 2 --
 include/linux/fs.h | 6 ++++++
 3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 8a33764..9b494ab 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -731,7 +731,7 @@ static int ext2_get_blocks(struct inode *inode,
 		goto cleanup;
 	}
 
-	if (ext2_use_xip(inode->i_sb)) {
+	if (IS_XIP(inode)) {
 		/*
 		 * we need to clear the block
 		 */
@@ -1201,7 +1201,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
 
 	inode_dio_wait(inode);
 
-	if (mapping_is_xip(inode->i_mapping))
+	if (IS_XIP(inode))
 		error = xip_truncate_page(inode->i_mapping, newsize);
 	else if (test_opt(inode->i_sb, NOBH))
 		error = nobh_truncate_page(inode->i_mapping,
@@ -1273,7 +1273,8 @@ void ext2_set_inode_flags(struct inode *inode)
 {
 	unsigned int flags = EXT2_I(inode)->i_flags;
 
-	inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+	inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME |
+				S_DIRSYNC | S_XIP);
 	if (flags & EXT2_SYNC_FL)
 		inode->i_flags |= S_SYNC;
 	if (flags & EXT2_APPEND_FL)
@@ -1284,6 +1285,8 @@ void ext2_set_inode_flags(struct inode *inode)
 		inode->i_flags |= S_NOATIME;
 	if (flags & EXT2_DIRSYNC_FL)
 		inode->i_flags |= S_DIRSYNC;
+	if (test_opt(inode->i_sb, XIP))
+		inode->i_flags |= S_XIP;
 }
 
 /* Propagate flags from i_flags to EXT2_I(inode)->i_flags */
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 18b34d2..29be737 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -16,9 +16,7 @@ static inline int ext2_use_xip (struct super_block *sb)
 }
 int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
 				void **, unsigned long *);
-#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_mem)
 #else
-#define mapping_is_xip(map)			0
 #define ext2_xip_verify_sb(sb)			do { } while (0)
 #define ext2_use_xip(sb)			0
 #define ext2_clear_xip_target(inode, chain)	0
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 121f11f..80cfb42 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1639,6 +1639,7 @@ struct super_operations {
 #define S_IMA		1024	/* Inode has an associated IMA struct */
 #define S_AUTOMOUNT	2048	/* Automount/referral quasi-directory */
 #define S_NOSEC		4096	/* no suid or xattr security attributes */
+#define S_XIP		8192	/* Avoid using the page cache */
 
 /*
  * Note that nosuid etc flags are inode-specific: setting some file-system
@@ -1676,6 +1677,11 @@ struct super_operations {
 #define IS_IMA(inode)		((inode)->i_flags & S_IMA)
 #define IS_AUTOMOUNT(inode)	((inode)->i_flags & S_AUTOMOUNT)
 #define IS_NOSEC(inode)		((inode)->i_flags & S_NOSEC)
+#ifdef CONFIG_FS_XIP
+#define IS_XIP(inode)		((inode)->i_flags & S_XIP)
+#else
+#define IS_XIP(inode)		0
+#endif
 
 /*
  * Inode state bits.  Protected by inode->i_lock
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 06/22] Treat XIP like O_DIRECT
  2014-01-16  1:24 [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Matthew Wilcox
                   ` (4 preceding siblings ...)
  2014-01-16  1:24 ` [PATCH v5 05/22] Introduce IS_XIP(inode) Matthew Wilcox
@ 2014-01-16  1:24 ` Matthew Wilcox
  2014-01-31 16:59   ` Jan Kara
  2014-01-16  1:24 ` [PATCH v5 07/22] Rewrite XIP page fault handling Matthew Wilcox
                   ` (22 subsequent siblings)
  28 siblings, 1 reply; 45+ messages in thread
From: Matthew Wilcox @ 2014-01-16  1:24 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-ext4; +Cc: Matthew Wilcox

Instead of separate read and write methods, use the generic AIO
infrastructure.  In addition to giving us support for AIO, this adds
the locking between read() and truncate() that was missing.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/Makefile        |   1 +
 fs/ext2/file.c     |   6 +-
 fs/ext2/inode.c    |   7 +-
 fs/xip.c           | 156 +++++++++++++++++++++++++++++++++++
 include/linux/fs.h |  18 ++++-
 mm/filemap.c       |   6 +-
 mm/filemap_xip.c   | 234 -----------------------------------------------------
 7 files changed, 183 insertions(+), 245 deletions(-)
 create mode 100644 fs/xip.c

diff --git a/fs/Makefile b/fs/Makefile
index 4fe6df3..96061f2 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,6 +29,7 @@ obj-$(CONFIG_SIGNALFD)		+= signalfd.o
 obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_AIO)               += aio.o
+obj-$(CONFIG_FS_XIP)		+= xip.o
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
 obj-$(CONFIG_COMPAT)		+= compat.o compat_ioctl.o
 obj-$(CONFIG_BINFMT_AOUT)	+= binfmt_aout.o
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index a5b3a5d..b0eb1d4 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -81,8 +81,10 @@ const struct file_operations ext2_file_operations = {
 #ifdef CONFIG_EXT2_FS_XIP
 const struct file_operations ext2_xip_file_operations = {
 	.llseek		= generic_file_llseek,
-	.read		= xip_file_read,
-	.write		= xip_file_write,
+	.read		= do_sync_read,
+	.write		= do_sync_write,
+	.aio_read	= generic_file_aio_read,
+	.aio_write	= generic_file_aio_write,
 	.unlocked_ioctl = ext2_ioctl,
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= ext2_compat_ioctl,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 9b494ab..3d11f1d 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -858,7 +858,11 @@ ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 	struct inode *inode = mapping->host;
 	ssize_t ret;
 
-	ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
+	if (IS_XIP(inode))
+		ret = xip_do_io(rw, iocb, inode, iov, offset, nr_segs,
+				ext2_get_block, NULL, DIO_LOCKING);
+	else
+		ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
 				 ext2_get_block);
 	if (ret < 0 && (rw & WRITE))
 		ext2_write_failed(mapping, offset + iov_length(iov, nr_segs));
@@ -888,6 +892,7 @@ const struct address_space_operations ext2_aops = {
 const struct address_space_operations ext2_aops_xip = {
 	.bmap			= ext2_bmap,
 	.get_xip_mem		= ext2_get_xip_mem,
+	.direct_IO		= ext2_direct_IO,
 };
 
 const struct address_space_operations ext2_nobh_aops = {
diff --git a/fs/xip.c b/fs/xip.c
new file mode 100644
index 0000000..0f0f15b
--- /dev/null
+++ b/fs/xip.c
@@ -0,0 +1,156 @@
+/*
+ * fs/xip.c - Execute In Place filesystem code
+ * Copyright (c) 2013 Intel Corporation
+ * Author: Matthew Wilcox <matthew.r.wilcox@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/atomic.h>
+#include <linux/blkdev.h>
+#include <linux/buffer_head.h>
+#include <linux/fs.h>
+#include <linux/genhd.h>
+#include <linux/mutex.h>
+#include <linux/uio.h>
+
+static long xip_get_addr(struct inode *inode, struct buffer_head *bh,
+								void **addr)
+{
+	struct block_device *bdev = bh->b_bdev;
+	const struct block_device_operations *ops = bdev->bd_disk->fops;
+	unsigned long pfn;
+	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
+	return ops->direct_access(bdev, sector, addr, &pfn, bh->b_size);
+}
+
+static ssize_t xip_io(int rw, struct inode *inode, const struct iovec *iov,
+			loff_t start, loff_t end, unsigned nr_segs,
+			get_block_t get_block, struct buffer_head *bh)
+{
+	ssize_t retval = 0;
+	unsigned seg = 0;
+	unsigned len;
+	unsigned copied = 0;
+	loff_t offset = start;
+	loff_t max = start;
+	void *addr;
+	bool hole = false;
+
+	while (offset < end) {
+		void __user *buf = iov[seg].iov_base + copied;
+
+		if (max == offset) {
+			sector_t block = offset >> inode->i_blkbits;
+			long size;
+			memset(bh, 0, sizeof(*bh));
+			bh->b_size = ALIGN(end - offset, PAGE_SIZE);
+			retval = get_block(inode, block, bh, rw == WRITE);
+			if (retval)
+				break;
+			if (buffer_mapped(bh)) {
+				retval = xip_get_addr(inode, bh, &addr);
+				if (retval < 0)
+					break;
+				addr += offset - (block << inode->i_blkbits);
+				hole = false;
+				size = retval;
+			} else {
+				if (rw == WRITE) {
+					retval = -EIO;
+					break;
+				}
+				addr = NULL;
+				hole = true;
+				size = bh->b_size;
+			}
+			max = offset + size;
+		}
+
+		len = min_t(unsigned, iov[seg].iov_len - copied, max - offset);
+
+		if (rw == WRITE)
+			len -= __copy_from_user_nocache(addr, buf, len);
+		else if (!hole)
+			len -= __copy_to_user(buf, addr, len);
+		else
+			len -= __clear_user(buf, len);
+
+		if (!len)
+			break;
+
+		offset += len;
+		copied += len;
+		if (copied == iov[seg].iov_len) {
+			seg++;
+			copied = 0;
+		}
+	}
+
+	return (offset == start) ? retval : offset - start;
+}
+
+/**
+ * xip_do_io - Perform I/O to an XIP file
+ * @rw: READ to read or WRITE to write
+ * @iocb: The control block for this I/O
+ * @inode: The file which the I/O is directed at
+ * @iov: The user addresses to do I/O from or to
+ * @offset: The file offset where the I/O starts
+ * @nr_segs: The length of the iov array
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ * @end_io: A filesystem callback for I/O completion
+ * @flags: See below
+ *
+ * This function uses the same locking scheme as do_blockdev_direct_IO:
+ * If @flags has DIO_LOCKING set, we assume that the i_mutex is held by the
+ * caller for writes.  For reads, we take and release the i_mutex ourselves.
+ * If DIO_LOCKING is not set, the filesystem takes care of its own locking.
+ * As with do_blockdev_direct_IO(), we increment i_dio_count while the I/O
+ * is in progress.
+ */
+ssize_t xip_do_io(int rw, struct kiocb *iocb, struct inode *inode,
+		const struct iovec *iov, loff_t offset, unsigned nr_segs,
+		get_block_t get_block, dio_iodone_t end_io, int flags)
+{
+	struct buffer_head bh;
+	unsigned seg;
+	ssize_t retval = -EINVAL;
+	loff_t end = offset;
+
+	for (seg = 0; seg < nr_segs; seg++)
+		end += iov[seg].iov_len;
+
+	if ((flags & DIO_LOCKING) && (rw == READ)) {
+		struct address_space *mapping = inode->i_mapping;
+		mutex_lock(&inode->i_mutex);
+		retval = filemap_write_and_wait_range(mapping, offset, end - 1);
+		if (retval) {
+			mutex_unlock(&inode->i_mutex);
+			goto out;
+		}
+	}
+
+	/* Protects against truncate */
+	atomic_inc(&inode->i_dio_count);
+
+	retval = xip_io(rw, inode, iov, offset, end, nr_segs, get_block, &bh);
+
+	if ((flags & DIO_LOCKING) && (rw == READ))
+		mutex_unlock(&inode->i_mutex);
+
+	inode_dio_done(inode);
+
+	if ((retval > 0) && end_io)
+		end_io(iocb, offset, retval, bh.b_private);
+ out:
+	return retval;
+}
+EXPORT_SYMBOL_GPL(xip_do_io);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 80cfb42..7cc5bf7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2509,17 +2509,22 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
 extern int nonseekable_open(struct inode * inode, struct file * filp);
 
 #ifdef CONFIG_FS_XIP
-extern ssize_t xip_file_read(struct file *filp, char __user *buf, size_t len,
-			     loff_t *ppos);
 extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
-extern ssize_t xip_file_write(struct file *filp, const char __user *buf,
-			      size_t len, loff_t *ppos);
 extern int xip_truncate_page(struct address_space *mapping, loff_t from);
+ssize_t xip_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
+		loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
 #else
 static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
 {
 	return 0;
 }
+
+static inline ssize_t xip_do_io(int rw, struct kiocb *iocb, struct inode *inode,
+		const struct iovec *iov, loff_t offset, unsigned nr_segs,
+		get_block_t get_block, dio_iodone_t end_io, int flags)
+{
+	return -ENOTTY;
+}
 #endif
 
 #ifdef CONFIG_BLOCK
@@ -2669,6 +2674,11 @@ extern int generic_show_options(struct seq_file *m, struct dentry *root);
 extern void save_mount_options(struct super_block *sb, char *options);
 extern void replace_mount_options(struct super_block *sb, char *options);
 
+static inline bool io_is_direct(struct file *filp)
+{
+	return (filp->f_flags & O_DIRECT) || IS_XIP(file_inode(filp));
+}
+
 static inline ino_t parent_ino(struct dentry *dentry)
 {
 	ino_t res;
diff --git a/mm/filemap.c b/mm/filemap.c
index b7749a9..61a31f0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1417,8 +1417,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
 	if (retval)
 		return retval;
 
-	/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
-	if (filp->f_flags & O_DIRECT) {
+	if (io_is_direct(filp)) {
 		loff_t size;
 		struct address_space *mapping;
 		struct inode *inode;
@@ -2470,8 +2469,7 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	if (err)
 		goto out;
 
-	/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
-	if (unlikely(file->f_flags & O_DIRECT)) {
+	if (io_is_direct(file)) {
 		loff_t endbyte;
 		ssize_t written_buffered;
 
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index c8d23e9..f7c37a1 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -42,119 +42,6 @@ static struct page *xip_sparse_page(void)
 }
 
 /*
- * This is a file read routine for execute in place files, and uses
- * the mapping->a_ops->get_xip_mem() function for the actual low-level
- * stuff.
- *
- * Note the struct file* is not used at all.  It may be NULL.
- */
-static ssize_t
-do_xip_mapping_read(struct address_space *mapping,
-		    struct file_ra_state *_ra,
-		    struct file *filp,
-		    char __user *buf,
-		    size_t len,
-		    loff_t *ppos)
-{
-	struct inode *inode = mapping->host;
-	pgoff_t index, end_index;
-	unsigned long offset;
-	loff_t isize, pos;
-	size_t copied = 0, error = 0;
-
-	BUG_ON(!mapping->a_ops->get_xip_mem);
-
-	pos = *ppos;
-	index = pos >> PAGE_CACHE_SHIFT;
-	offset = pos & ~PAGE_CACHE_MASK;
-
-	isize = i_size_read(inode);
-	if (!isize)
-		goto out;
-
-	end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
-	do {
-		unsigned long nr, left;
-		void *xip_mem;
-		unsigned long xip_pfn;
-		int zero = 0;
-
-		/* nr is the maximum number of bytes to copy from this page */
-		nr = PAGE_CACHE_SIZE;
-		if (index >= end_index) {
-			if (index > end_index)
-				goto out;
-			nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
-			if (nr <= offset) {
-				goto out;
-			}
-		}
-		nr = nr - offset;
-		if (nr > len - copied)
-			nr = len - copied;
-
-		error = mapping->a_ops->get_xip_mem(mapping, index, 0,
-							&xip_mem, &xip_pfn);
-		if (unlikely(error)) {
-			if (error == -ENODATA) {
-				/* sparse */
-				zero = 1;
-			} else
-				goto out;
-		}
-
-		/* If users can be writing to this page using arbitrary
-		 * virtual addresses, take care about potential aliasing
-		 * before reading the page on the kernel side.
-		 */
-		if (mapping_writably_mapped(mapping))
-			/* address based flush */ ;
-
-		/*
-		 * Ok, we have the mem, so now we can copy it to user space...
-		 *
-		 * The actor routine returns how many bytes were actually used..
-		 * NOTE! This may not be the same as how much of a user buffer
-		 * we filled up (we may be padding etc), so we can only update
-		 * "pos" here (the actor routine has to update the user buffer
-		 * pointers and the remaining count).
-		 */
-		if (!zero)
-			left = __copy_to_user(buf+copied, xip_mem+offset, nr);
-		else
-			left = __clear_user(buf + copied, nr);
-
-		if (left) {
-			error = -EFAULT;
-			goto out;
-		}
-
-		copied += (nr - left);
-		offset += (nr - left);
-		index += offset >> PAGE_CACHE_SHIFT;
-		offset &= ~PAGE_CACHE_MASK;
-	} while (copied < len);
-
-out:
-	*ppos = pos + copied;
-	if (filp)
-		file_accessed(filp);
-
-	return (copied ? copied : error);
-}
-
-ssize_t
-xip_file_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
-{
-	if (!access_ok(VERIFY_WRITE, buf, len))
-		return -EFAULT;
-
-	return do_xip_mapping_read(filp->f_mapping, &filp->f_ra, filp,
-			    buf, len, ppos);
-}
-EXPORT_SYMBOL_GPL(xip_file_read);
-
-/*
  * __xip_unmap is invoked from xip_unmap and
  * xip_write
  *
@@ -340,127 +227,6 @@ int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
 }
 EXPORT_SYMBOL_GPL(xip_file_mmap);
 
-static ssize_t
-__xip_file_write(struct file *filp, const char __user *buf,
-		  size_t count, loff_t pos, loff_t *ppos)
-{
-	struct address_space * mapping = filp->f_mapping;
-	const struct address_space_operations *a_ops = mapping->a_ops;
-	struct inode 	*inode = mapping->host;
-	long		status = 0;
-	size_t		bytes;
-	ssize_t		written = 0;
-
-	BUG_ON(!mapping->a_ops->get_xip_mem);
-
-	do {
-		unsigned long index;
-		unsigned long offset;
-		size_t copied;
-		void *xip_mem;
-		unsigned long xip_pfn;
-
-		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
-		index = pos >> PAGE_CACHE_SHIFT;
-		bytes = PAGE_CACHE_SIZE - offset;
-		if (bytes > count)
-			bytes = count;
-
-		status = a_ops->get_xip_mem(mapping, index, 0,
-						&xip_mem, &xip_pfn);
-		if (status == -ENODATA) {
-			/* we allocate a new page unmap it */
-			mutex_lock(&xip_sparse_mutex);
-			status = a_ops->get_xip_mem(mapping, index, 1,
-							&xip_mem, &xip_pfn);
-			mutex_unlock(&xip_sparse_mutex);
-			if (!status)
-				/* unmap page at pgoff from all other vmas */
-				__xip_unmap(mapping, index);
-		}
-
-		if (status)
-			break;
-
-		copied = bytes -
-			__copy_from_user_nocache(xip_mem + offset, buf, bytes);
-
-		if (likely(copied > 0)) {
-			status = copied;
-
-			if (status >= 0) {
-				written += status;
-				count -= status;
-				pos += status;
-				buf += status;
-			}
-		}
-		if (unlikely(copied != bytes))
-			if (status >= 0)
-				status = -EFAULT;
-		if (status < 0)
-			break;
-	} while (count);
-	*ppos = pos;
-	/*
-	 * No need to use i_size_read() here, the i_size
-	 * cannot change under us because we hold i_mutex.
-	 */
-	if (pos > inode->i_size) {
-		i_size_write(inode, pos);
-		mark_inode_dirty(inode);
-	}
-
-	return written ? written : status;
-}
-
-ssize_t
-xip_file_write(struct file *filp, const char __user *buf, size_t len,
-	       loff_t *ppos)
-{
-	struct address_space *mapping = filp->f_mapping;
-	struct inode *inode = mapping->host;
-	size_t count;
-	loff_t pos;
-	ssize_t ret;
-
-	mutex_lock(&inode->i_mutex);
-
-	if (!access_ok(VERIFY_READ, buf, len)) {
-		ret=-EFAULT;
-		goto out_up;
-	}
-
-	pos = *ppos;
-	count = len;
-
-	/* We can write back this queue in page reclaim */
-	current->backing_dev_info = mapping->backing_dev_info;
-
-	ret = generic_write_checks(filp, &pos, &count, S_ISBLK(inode->i_mode));
-	if (ret)
-		goto out_backing;
-	if (count == 0)
-		goto out_backing;
-
-	ret = file_remove_suid(filp);
-	if (ret)
-		goto out_backing;
-
-	ret = file_update_time(filp);
-	if (ret)
-		goto out_backing;
-
-	ret = __xip_file_write (filp, buf, count, pos, ppos);
-
- out_backing:
-	current->backing_dev_info = NULL;
- out_up:
-	mutex_unlock(&inode->i_mutex);
-	return ret;
-}
-EXPORT_SYMBOL_GPL(xip_file_write);
-
 /*
  * truncate a page used for execute in place
  * functionality is analog to block_truncate_page but does use get_xip_mem
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 07/22] Rewrite XIP page fault handling
  2014-01-16  1:24 [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Matthew Wilcox
                   ` (5 preceding siblings ...)
  2014-01-16  1:24 ` [PATCH v5 06/22] Treat XIP like O_DIRECT Matthew Wilcox
@ 2014-01-16  1:24 ` Matthew Wilcox
  2014-01-16  1:24 ` [PATCH v5 08/22] Change xip_truncate_page to take a get_block parameter Matthew Wilcox
                   ` (21 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2014-01-16  1:24 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-ext4; +Cc: Matthew Wilcox

Instead of calling aops->get_xip_mem from the fault handler, allow the
filesystem to pass in a get_block_t that is used to find the appropriate
blocks.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext2/file.c     |  35 ++++++++-
 fs/xip.c           | 167 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h |   4 +-
 mm/filemap_xip.c   | 206 -----------------------------------------------------
 4 files changed, 203 insertions(+), 209 deletions(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index b0eb1d4..9e88388 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -25,6 +25,37 @@
 #include "xattr.h"
 #include "acl.h"
 
+#ifdef CONFIG_EXT2_FS_XIP
+static int ext2_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return xip_fault(vma, vmf, ext2_get_block);
+}
+
+static int ext2_xip_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return xip_mkwrite(vma, vmf, ext2_get_block);
+}
+
+static const struct vm_operations_struct ext2_xip_vm_ops = {
+	.fault		= ext2_xip_fault,
+	.page_mkwrite	= ext2_xip_mkwrite,
+	.remap_pages	= generic_file_remap_pages,
+};
+
+static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	if (!IS_XIP(file_inode(file)))
+		return generic_file_mmap(file, vma);
+
+	file_accessed(file);
+	vma->vm_ops = &ext2_xip_vm_ops;
+	vma->vm_flags |= VM_MIXEDMAP;
+	return 0;
+}
+#else
+#define ext2_file_mmap	generic_file_mmap
+#endif
+
 /*
  * Called when filp is released. This happens when all file descriptors
  * for a single struct file are closed. Note that different open() calls
@@ -70,7 +101,7 @@ const struct file_operations ext2_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= ext2_compat_ioctl,
 #endif
-	.mmap		= generic_file_mmap,
+	.mmap		= ext2_file_mmap,
 	.open		= dquot_file_open,
 	.release	= ext2_release_file,
 	.fsync		= ext2_fsync,
@@ -89,7 +120,7 @@ const struct file_operations ext2_xip_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= ext2_compat_ioctl,
 #endif
-	.mmap		= xip_file_mmap,
+	.mmap		= ext2_file_mmap,
 	.open		= dquot_file_open,
 	.release	= ext2_release_file,
 	.fsync		= ext2_fsync,
diff --git a/fs/xip.c b/fs/xip.c
index 0f0f15b..bdedd56 100644
--- a/fs/xip.c
+++ b/fs/xip.c
@@ -18,6 +18,8 @@
 #include <linux/buffer_head.h>
 #include <linux/fs.h>
 #include <linux/genhd.h>
+#include <linux/highmem.h>
+#include <linux/mm.h>
 #include <linux/mutex.h>
 #include <linux/uio.h>
 
@@ -31,6 +33,16 @@ static long xip_get_addr(struct inode *inode, struct buffer_head *bh,
 	return ops->direct_access(bdev, sector, addr, &pfn, bh->b_size);
 }
 
+static long xip_get_pfn(struct inode *inode, struct buffer_head *bh,
+							unsigned long *pfn)
+{
+	struct block_device *bdev = bh->b_bdev;
+	const struct block_device_operations *ops = bdev->bd_disk->fops;
+	void *addr;
+	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
+	return ops->direct_access(bdev, sector, &addr, pfn, bh->b_size);
+}
+
 static ssize_t xip_io(int rw, struct inode *inode, const struct iovec *iov,
 			loff_t start, loff_t end, unsigned nr_segs,
 			get_block_t get_block, struct buffer_head *bh)
@@ -154,3 +166,158 @@ ssize_t xip_do_io(int rw, struct kiocb *iocb, struct inode *inode,
 	return retval;
 }
 EXPORT_SYMBOL_GPL(xip_do_io);
+
+/*
+ * The user has performed a load from a hole in the file.  Allocating
+ * a new page in the file would cause excessive storage usage for
+ * workloads with sparse files.  We allocate a page cache page instead.
+ * We'll kick it out of the page cache if it's ever written to,
+ * otherwise it will simply fall out of the page cache under memory
+ * pressure without ever having been dirtied.
+ */
+static int xip_load_hole(struct address_space *mapping, struct vm_fault *vmf)
+{
+	unsigned long size;
+	struct inode *inode = mapping->host;
+	struct page *page = find_or_create_page(mapping, vmf->pgoff,
+						GFP_KERNEL | __GFP_ZERO);
+	if (!page)
+		return VM_FAULT_OOM;
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (vmf->pgoff >= size) {
+		unlock_page(page);
+		page_cache_release(page);
+		return VM_FAULT_SIGBUS;
+	}
+
+	vmf->page = page;
+	return VM_FAULT_LOCKED;
+}
+
+static void copy_user_bh(struct page *to, struct inode *inode,
+				struct buffer_head *bh, unsigned long vaddr)
+{
+	void *vfrom, *vto;
+	xip_get_addr(inode, bh, &vfrom);	/* XXX: error handling */
+	vto = kmap_atomic(to);
+	copy_user_page(vto, vfrom, vaddr, to);
+	kunmap_atomic(vto);
+}
+
+static int do_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+			get_block_t get_block)
+{
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	struct address_space *mapping = file->f_mapping;
+	struct buffer_head bh;
+	unsigned long vaddr = (unsigned long)vmf->virtual_address;
+	sector_t block;
+	pgoff_t size;
+	unsigned long pfn;
+	int error;
+
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (vmf->pgoff >= size)
+		return VM_FAULT_SIGBUS;
+
+	memset(&bh, 0, sizeof(bh));
+	block = (sector_t)vmf->pgoff << (PAGE_SHIFT - inode->i_blkbits);
+	bh.b_size = PAGE_SIZE;
+	error = get_block(inode, block, &bh, 0);
+	if (error || bh.b_size < PAGE_SIZE)
+		return VM_FAULT_SIGBUS;
+
+	if (!buffer_mapped(&bh) && !vmf->cow_page) {
+		if (vmf->flags & FAULT_FLAG_WRITE) {
+			error = get_block(inode, block, &bh, 1);
+			if (error || bh.b_size < PAGE_SIZE)
+				return VM_FAULT_SIGBUS;
+		} else {
+			return xip_load_hole(mapping, vmf);
+		}
+	}
+
+	/* Recheck i_size under i_mmap_mutex */
+	mutex_lock(&mapping->i_mmap_mutex);
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (unlikely(vmf->pgoff >= size)) {
+		mutex_unlock(&mapping->i_mmap_mutex);
+		return VM_FAULT_SIGBUS;
+	}
+	if (vmf->cow_page) {
+		if (buffer_mapped(&bh))
+			copy_user_bh(vmf->cow_page, inode, &bh, vaddr);
+		else
+			clear_user_highpage(vmf->cow_page, vaddr);
+		return VM_FAULT_COWED;
+	}
+
+	error = xip_get_pfn(inode, &bh, &pfn);
+	if (error > 0)
+		error = vm_insert_mixed(vma, vaddr, pfn);
+	mutex_unlock(&mapping->i_mmap_mutex);
+	if (error == -ENOMEM)
+		return VM_FAULT_OOM;
+	/* -EBUSY is fine, somebody else faulted on the same PTE */
+	if (error != -EBUSY)
+		BUG_ON(error);
+	return VM_FAULT_NOPAGE;
+}
+
+/**
+ * xip_fault - handle a page fault on an XIP file
+ * @vma: The virtual memory area where the fault occurred
+ * @vmf: The description of the fault
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * When a page fault occurs, filesystems may call this helper in their
+ * fault handler for XIP files.
+ */
+int xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+			get_block_t get_block)
+{
+	int result;
+	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+
+	sb_start_pagefault(sb);
+	file_update_time(vma->vm_file);
+	result = do_xip_fault(vma, vmf, get_block);
+	sb_end_pagefault(sb);
+
+	return result;
+}
+EXPORT_SYMBOL_GPL(xip_fault);
+
+/**
+ * xip_mkwrite - convert a read-only page to read-write in an XIP file
+ * @vma: The virtual memory area where the fault occurred
+ * @vmf: The description of the fault
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * XIP handles reads of holes by adding pages full of zeroes into the
+ * mapping.  If the page is subsequenty written to, we have to allocate
+ * the page on media and free the page that was in the cache.
+ */
+int xip_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
+			get_block_t get_block)
+{
+	int result;
+	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+
+	sb_start_pagefault(sb);
+	file_update_time(vma->vm_file);
+	result = do_xip_fault(vma, vmf, get_block);
+	sb_end_pagefault(sb);
+
+	if (!(result & VM_FAULT_ERROR)) {
+		struct page *page = vmf->page;
+		unmap_mapping_range(page->mapping,
+					(loff_t)page->index << PAGE_CACHE_SHIFT,
+					PAGE_CACHE_SIZE, 0);
+		delete_from_page_cache(page);
+	}
+
+	return result;
+}
+EXPORT_SYMBOL_GPL(xip_mkwrite);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7cc5bf7..aca9a1c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -48,6 +48,7 @@ struct cred;
 struct swap_info_struct;
 struct seq_file;
 struct workqueue_struct;
+struct vm_fault;
 
 extern void __init inode_init(void);
 extern void __init inode_init_early(void);
@@ -2509,10 +2510,11 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
 extern int nonseekable_open(struct inode * inode, struct file * filp);
 
 #ifdef CONFIG_FS_XIP
-extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
 extern int xip_truncate_page(struct address_space *mapping, loff_t from);
 ssize_t xip_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
 		loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
+int xip_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+int xip_mkwrite(struct vm_area_struct *, struct vm_fault *, get_block_t);
 #else
 static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
 {
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index f7c37a1..9dd45f3 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -22,212 +22,6 @@
 #include <asm/io.h>
 
 /*
- * We do use our own empty page to avoid interference with other users
- * of ZERO_PAGE(), such as /dev/zero
- */
-static DEFINE_MUTEX(xip_sparse_mutex);
-static seqcount_t xip_sparse_seq = SEQCNT_ZERO(xip_sparse_seq);
-static struct page *__xip_sparse_page;
-
-/* called under xip_sparse_mutex */
-static struct page *xip_sparse_page(void)
-{
-	if (!__xip_sparse_page) {
-		struct page *page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
-
-		if (page)
-			__xip_sparse_page = page;
-	}
-	return __xip_sparse_page;
-}
-
-/*
- * __xip_unmap is invoked from xip_unmap and
- * xip_write
- *
- * This function walks all vmas of the address_space and unmaps the
- * __xip_sparse_page when found at pgoff.
- */
-static void
-__xip_unmap (struct address_space * mapping,
-		     unsigned long pgoff)
-{
-	struct vm_area_struct *vma;
-	struct mm_struct *mm;
-	unsigned long address;
-	pte_t *pte;
-	pte_t pteval;
-	spinlock_t *ptl;
-	struct page *page;
-	unsigned count;
-	int locked = 0;
-
-	count = read_seqcount_begin(&xip_sparse_seq);
-
-	page = __xip_sparse_page;
-	if (!page)
-		return;
-
-retry:
-	mutex_lock(&mapping->i_mmap_mutex);
-	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
-		mm = vma->vm_mm;
-		address = vma->vm_start +
-			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
-		BUG_ON(address < vma->vm_start || address >= vma->vm_end);
-		pte = page_check_address(page, mm, address, &ptl, 1);
-		if (pte) {
-			/* Nuke the page table entry. */
-			flush_cache_page(vma, address, pte_pfn(*pte));
-			pteval = ptep_clear_flush(vma, address, pte);
-			page_remove_rmap(page);
-			dec_mm_counter(mm, MM_FILEPAGES);
-			BUG_ON(pte_dirty(pteval));
-			pte_unmap_unlock(pte, ptl);
-			/* must invalidate_page _before_ freeing the page */
-			mmu_notifier_invalidate_page(mm, address);
-			page_cache_release(page);
-		}
-	}
-	mutex_unlock(&mapping->i_mmap_mutex);
-
-	if (locked) {
-		mutex_unlock(&xip_sparse_mutex);
-	} else if (read_seqcount_retry(&xip_sparse_seq, count)) {
-		mutex_lock(&xip_sparse_mutex);
-		locked = 1;
-		goto retry;
-	}
-}
-
-/*
- * xip_fault() is invoked via the vma operations vector for a
- * mapped memory region to read in file data during a page fault.
- *
- * This function is derived from filemap_fault, but used for execute in place
- */
-static int xip_file_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
-{
-	struct file *file = vma->vm_file;
-	struct address_space *mapping = file->f_mapping;
-	struct inode *inode = mapping->host;
-	pgoff_t size;
-	void *xip_mem;
-	unsigned long xip_pfn;
-	struct page *page;
-	int error;
-
-	/* XXX: are VM_FAULT_ codes OK? */
-again:
-	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	if (vmf->pgoff >= size)
-		return VM_FAULT_SIGBUS;
-
-	error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0,
-						&xip_mem, &xip_pfn);
-	if (likely(!error))
-		goto found;
-	if (error != -ENODATA)
-		return VM_FAULT_OOM;
-
-	/* sparse block */
-	if ((vma->vm_flags & (VM_WRITE | VM_MAYWRITE)) &&
-	    (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) &&
-	    (!(mapping->host->i_sb->s_flags & MS_RDONLY))) {
-		int err;
-
-		/* maybe shared writable, allocate new block */
-		mutex_lock(&xip_sparse_mutex);
-		error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 1,
-							&xip_mem, &xip_pfn);
-		mutex_unlock(&xip_sparse_mutex);
-		if (error)
-			return VM_FAULT_SIGBUS;
-		/* unmap sparse mappings at pgoff from all other vmas */
-		__xip_unmap(mapping, vmf->pgoff);
-
-found:
-		/* We must recheck i_size under i_mmap_mutex */
-		mutex_lock(&mapping->i_mmap_mutex);
-		size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
-							PAGE_CACHE_SHIFT;
-		if (unlikely(vmf->pgoff >= size)) {
-			mutex_unlock(&mapping->i_mmap_mutex);
-			return VM_FAULT_SIGBUS;
-		}
-		err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
-							xip_pfn);
-		mutex_unlock(&mapping->i_mmap_mutex);
-		if (err == -ENOMEM)
-			return VM_FAULT_OOM;
-		/*
-		 * err == -EBUSY is fine, we've raced against another thread
-		 * that faulted-in the same page
-		 */
-		if (err != -EBUSY)
-			BUG_ON(err);
-		return VM_FAULT_NOPAGE;
-	} else {
-		int err, ret = VM_FAULT_OOM;
-
-		mutex_lock(&xip_sparse_mutex);
-		write_seqcount_begin(&xip_sparse_seq);
-		error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0,
-							&xip_mem, &xip_pfn);
-		if (unlikely(!error)) {
-			write_seqcount_end(&xip_sparse_seq);
-			mutex_unlock(&xip_sparse_mutex);
-			goto again;
-		}
-		if (error != -ENODATA)
-			goto out;
-
-		/* We must recheck i_size under i_mmap_mutex */
-		mutex_lock(&mapping->i_mmap_mutex);
-		size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
-							PAGE_CACHE_SHIFT;
-		if (unlikely(vmf->pgoff >= size)) {
-			ret = VM_FAULT_SIGBUS;
-			goto unlock;
-		}
-		/* not shared and writable, use xip_sparse_page() */
-		page = xip_sparse_page();
-		if (!page)
-			goto unlock;
-		err = vm_insert_page(vma, (unsigned long)vmf->virtual_address,
-							page);
-		if (err == -ENOMEM)
-			goto unlock;
-
-		ret = VM_FAULT_NOPAGE;
-unlock:
-		mutex_unlock(&mapping->i_mmap_mutex);
-out:
-		write_seqcount_end(&xip_sparse_seq);
-		mutex_unlock(&xip_sparse_mutex);
-
-		return ret;
-	}
-}
-
-static const struct vm_operations_struct xip_file_vm_ops = {
-	.fault	= xip_file_fault,
-	.page_mkwrite	= filemap_page_mkwrite,
-	.remap_pages = generic_file_remap_pages,
-};
-
-int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
-{
-	BUG_ON(!file->f_mapping->a_ops->get_xip_mem);
-
-	file_accessed(file);
-	vma->vm_ops = &xip_file_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP;
-	return 0;
-}
-EXPORT_SYMBOL_GPL(xip_file_mmap);
-
-/*
  * truncate a page used for execute in place
  * functionality is analog to block_truncate_page but does use get_xip_mem
  * to get the page instead of page cache
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 08/22] Change xip_truncate_page to take a get_block parameter
  2014-01-16  1:24 [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Matthew Wilcox
                   ` (6 preceding siblings ...)
  2014-01-16  1:24 ` [PATCH v5 07/22] Rewrite XIP page fault handling Matthew Wilcox
@ 2014-01-16  1:24 ` Matthew Wilcox
  2014-01-16  1:24 ` [PATCH v5 09/22] Remove mm/filemap_xip.c Matthew Wilcox
                   ` (20 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2014-01-16  1:24 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-ext4; +Cc: Matthew Wilcox

Just like nobh_truncate_page() and block_truncate_page()

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext2/inode.c    |  2 +-
 fs/xip.c           | 44 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h |  4 ++--
 mm/filemap_xip.c   | 40 ----------------------------------------
 4 files changed, 47 insertions(+), 43 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 3d11f1d..c531700 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1207,7 +1207,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
 	inode_dio_wait(inode);
 
 	if (IS_XIP(inode))
-		error = xip_truncate_page(inode->i_mapping, newsize);
+		error = xip_truncate_page(inode, newsize, ext2_get_block);
 	else if (test_opt(inode->i_sb, NOBH))
 		error = nobh_truncate_page(inode->i_mapping,
 				newsize, ext2_get_block);
diff --git a/fs/xip.c b/fs/xip.c
index bdedd56..aacb6a8 100644
--- a/fs/xip.c
+++ b/fs/xip.c
@@ -321,3 +321,47 @@ int xip_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 	return result;
 }
 EXPORT_SYMBOL_GPL(xip_mkwrite);
+
+/**
+ * xip_truncate_page - handle a partial page being truncated in an XIP file
+ * @inode: The file being truncated
+ * @from: The file offset that is being truncated to
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * Similar to block_truncate_page(), this function can be called by a
+ * filesystem when it is truncating an XIP file to handle the partial page.
+ *
+ * We work in terms of PAGE_CACHE_SIZE here for commonality with
+ * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
+ * took care of disposing of the unnecessary blocks.  Even if the filesystem
+ * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
+ * since the file might be mmaped.
+ */
+int xip_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
+{
+	struct buffer_head bh;
+	pgoff_t index = from >> PAGE_CACHE_SHIFT;
+	unsigned offset = from & (PAGE_CACHE_SIZE-1);
+	unsigned length = PAGE_CACHE_ALIGN(from) - from;
+	int err;
+
+	/* Block boundary? Nothing to do */
+	if (!length)
+		return 0;
+
+	memset(&bh, 0, sizeof(bh));
+	bh.b_size = PAGE_CACHE_SIZE;
+	err = get_block(inode, index, &bh, 0);
+	if (err < 0)
+		return err;
+	if (buffer_mapped(&bh)) {
+		void *addr;
+		err = xip_get_addr(inode, &bh, &addr);
+		if (err)
+			return err;
+		memset(addr + offset, 0, length);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(xip_truncate_page);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index aca9a1c..c77efa3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2510,13 +2510,13 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
 extern int nonseekable_open(struct inode * inode, struct file * filp);
 
 #ifdef CONFIG_FS_XIP
-extern int xip_truncate_page(struct address_space *mapping, loff_t from);
+int xip_truncate_page(struct inode *, loff_t from, get_block_t);
 ssize_t xip_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
 		loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
 int xip_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int xip_mkwrite(struct vm_area_struct *, struct vm_fault *, get_block_t);
 #else
-static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
+static inline int xip_truncate_page(struct inode *i, loff_t frm, get_block_t gb)
 {
 	return 0;
 }
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 9dd45f3..6316578 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -21,43 +21,3 @@
 #include <asm/tlbflush.h>
 #include <asm/io.h>
 
-/*
- * truncate a page used for execute in place
- * functionality is analog to block_truncate_page but does use get_xip_mem
- * to get the page instead of page cache
- */
-int
-xip_truncate_page(struct address_space *mapping, loff_t from)
-{
-	pgoff_t index = from >> PAGE_CACHE_SHIFT;
-	unsigned offset = from & (PAGE_CACHE_SIZE-1);
-	unsigned blocksize;
-	unsigned length;
-	void *xip_mem;
-	unsigned long xip_pfn;
-	int err;
-
-	BUG_ON(!mapping->a_ops->get_xip_mem);
-
-	blocksize = 1 << mapping->host->i_blkbits;
-	length = offset & (blocksize - 1);
-
-	/* Block boundary? Nothing to do */
-	if (!length)
-		return 0;
-
-	length = blocksize - length;
-
-	err = mapping->a_ops->get_xip_mem(mapping, index, 0,
-						&xip_mem, &xip_pfn);
-	if (unlikely(err)) {
-		if (err == -ENODATA)
-			/* Hole? No need to truncate */
-			return 0;
-		else
-			return err;
-	}
-	memset(xip_mem + offset, 0, length);
-	return 0;
-}
-EXPORT_SYMBOL_GPL(xip_truncate_page);
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 09/22] Remove mm/filemap_xip.c
  2014-01-16  1:24 [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Matthew Wilcox
                   ` (7 preceding siblings ...)
  2014-01-16  1:24 ` [PATCH v5 08/22] Change xip_truncate_page to take a get_block parameter Matthew Wilcox
@ 2014-01-16  1:24 ` Matthew Wilcox
  2014-01-16  1:24 ` [PATCH v5 10/22] Remove get_xip_mem Matthew Wilcox
                   ` (19 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2014-01-16  1:24 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-ext4; +Cc: Matthew Wilcox

It is now empty as all of its contents have been replaced by fs/xip.c

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 mm/Makefile      |  1 -
 mm/filemap_xip.c | 23 -----------------------
 2 files changed, 24 deletions(-)
 delete mode 100644 mm/filemap_xip.c

diff --git a/mm/Makefile b/mm/Makefile
index 305d10a..c019f5e 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -47,7 +47,6 @@ obj-$(CONFIG_SLUB) += slub.o
 obj-$(CONFIG_KMEMCHECK) += kmemcheck.o
 obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
-obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
deleted file mode 100644
index 6316578..0000000
--- a/mm/filemap_xip.c
+++ /dev/null
@@ -1,23 +0,0 @@
-/*
- *	linux/mm/filemap_xip.c
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte <cotte@de.ibm.com>
- *
- * derived from linux/mm/filemap.c - Copyright (C) Linus Torvalds
- *
- */
-
-#include <linux/fs.h>
-#include <linux/pagemap.h>
-#include <linux/export.h>
-#include <linux/uio.h>
-#include <linux/rmap.h>
-#include <linux/mmu_notifier.h>
-#include <linux/sched.h>
-#include <linux/seqlock.h>
-#include <linux/mutex.h>
-#include <linux/gfp.h>
-#include <asm/tlbflush.h>
-#include <asm/io.h>
-
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 10/22] Remove get_xip_mem
  2014-01-16  1:24 [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Matthew Wilcox
                   ` (8 preceding siblings ...)
  2014-01-16  1:24 ` [PATCH v5 09/22] Remove mm/filemap_xip.c Matthew Wilcox
@ 2014-01-16  1:24 ` Matthew Wilcox
  2014-01-16  1:46   ` Randy Dunlap
  2014-01-16  1:24 ` [PATCH v5 11/22] Replace ext2_clear_xip_target with xip_clear_blocks Matthew Wilcox
                   ` (18 subsequent siblings)
  28 siblings, 1 reply; 45+ messages in thread
From: Matthew Wilcox @ 2014-01-16  1:24 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-ext4; +Cc: Matthew Wilcox

All callers of get_xip_mem() are now gone.  Remove checks for it,
initialisers of it, documentation of it and the only implementation of it.

Add documentation for the new way of writing an XIP filesystem.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 Documentation/filesystems/Locking |   3 -
 Documentation/filesystems/xip.txt | 116 +++++++++++++++++++++-----------------
 fs/exofs/inode.c                  |   1 -
 fs/ext2/inode.c                   |   1 -
 fs/ext2/xip.c                     |  37 ------------
 fs/ext2/xip.h                     |   3 -
 fs/open.c                         |   5 +-
 include/linux/fs.h                |   2 -
 mm/fadvise.c                      |   6 +-
 mm/madvise.c                      |   2 +-
 10 files changed, 70 insertions(+), 106 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index fe7afe2..75da4b6 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -194,8 +194,6 @@ prototypes:
 	void (*freepage)(struct page *);
 	int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
 			loff_t offset, unsigned long nr_segs);
-	int (*get_xip_mem)(struct address_space *, pgoff_t, int, void **,
-				unsigned long *);
 	int (*migratepage)(struct address_space *, struct page *, struct page *);
 	int (*launder_page)(struct page *);
 	int (*is_partially_uptodate)(struct page *, read_descriptor_t *, unsigned long);
@@ -220,7 +218,6 @@ invalidatepage:		yes
 releasepage:		yes
 freepage:		yes
 direct_IO:
-get_xip_mem:					maybe
 migratepage:		yes (both)
 launder_page:		yes
 is_partially_uptodate:	yes
diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
index b62eabf..520e73a 100644
--- a/Documentation/filesystems/xip.txt
+++ b/Documentation/filesystems/xip.txt
@@ -3,69 +3,81 @@ Execute-in-place for file mappings
 
 Motivation
 ----------
-File mappings are performed by mapping page cache pages to userspace. In
-addition, read&write type file operations also transfer data from/to the page
-cache.
-
-For memory backed storage devices that use the block device interface, the page
-cache pages are in fact copies of the original storage. Various approaches
-exist to work around the need for an extra copy. The ramdisk driver for example
-does read the data into the page cache, keeps a reference, and discards the
-original data behind later on.
-
-Execute-in-place solves this issue the other way around: instead of keeping
-data in the page cache, the need to have a page cache copy is eliminated
-completely. With execute-in-place, read&write type operations are performed
-directly from/to the memory backed storage device. For file mappings, the
-storage device itself is mapped directly into userspace.
-
-This implementation was initially written for shared memory segments between
-different virtual machines on s390 hardware to allow multiple machines to
-share the same binaries and libraries.
-
-Implementation
---------------
-Execute-in-place is implemented in three steps: block device operation,
-address space operation, and file operations.
-
-A block device operation named direct_access is used to translate the
-block device sector number to a page frame number (pfn) that identifies
-the physical page for the memory.  It also returns a kernel virtual
-address that can be used to access the memory.
+
+File mappings are usually performed by mapping page cache pages to
+userspace.  In addition, read & write file operations also transfer data
+between the page cache and storage.
+
+For memory backed storage devices that use the block device interface,
+the page cache pages are just copies of the original storage.  The
+execute-in-place code removes the extra copy by performing reads and
+writes directly on the memory backed storage device.  For file mappings,
+the storage device itself is mapped directly into userspace.
+
+
+Implementation Tips for Block Driver Writers
+--------------------------------------------
+
+To support XIP in your block driver, implement the 'direct_access'
+block device operation.  It is used to translate the sector number
+(expressed in units of 512-byte sectors) to a page frame number (pfn)
+that identifies the physical page for the memory.  It also returns a
+kernel virtual address that can be used to access the memory.
 
 The direct_access method takes a 'size' parameter that indicates the
 number of bytes being requested.  The function should return the number
 of bytes that it can provide, although it must not exceed the number of
 bytes requested.  It may also return a negative errno if an error occurs.
 
-The block device operation is optional, these block devices support it as of
-today:
+In order to support this method, the storage must be byte-accessable by
+the CPU at all times.  If your device uses paging techniques to expose
+a large amount of memory through a smaller window, then you cannot
+implement direct_access.  Equally, if your device can occasionally
+stall the CPU for an extended period, you should also not attempt to
+implement direct_access.
+
+These block devices may be used for inspiration:
+- axonram: Axon DDR2 device driver
+- brd: RAM backed block device driver
 - dcssblk: s390 dcss block device driver
 
-An address space operation named get_xip_mem is used to retrieve references
-to a page frame number and a kernel address. To obtain these values a reference
-to an address_space is provided. This function assigns values to the kmem and
-pfn parameters. The third argument indicates whether the function should allocate
-blocks if needed.
 
-This address space operation is mutually exclusive with readpage&writepage that
-do page cache read/write operations.
-The following filesystems support it as of today:
-- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
+Implementation Tips for Filesystem Writers
+------------------------------------------
+
+Filesystem support consists of
+- adding support to mark inodes as being XIP by setting the S_XIP flag in
+  i_flags
+- implementing the direct_IO address space operation, and calling
+  xip_do_io() instead of blockdev_direct_IO() if S_XIP is set
+- implementing an mmap file operation for XIP files which sets the
+  VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers
+  for fault and page_mkwrite (which should probably call xip_fault() and
+  xip_mkwrite(), passing the appropriate get_block() callback)
+- calling xip_truncate_page() instead of block_truncate_page() for XIP files
+- ensuring that there is sufficient locking between reads, writes,
+  truncates and page faults
+
+The get_block() callback passed to xip_do_io(), xip_fault(), xip_mkwrite()
+and xip_truncate_page() must not return uninitialised extents.  It must zero
+any blocks that it returns, and it must ensure that simultaneous calls to
+get_block() (for example by a page-fault racing with a read() or a write())
+work correctly.
 
-A set of file operations that do utilize get_xip_page can be found in
-mm/filemap_xip.c . The following file operation implementations are provided:
-- aio_read/aio_write
-- readv/writev
-- sendfile
+These filesystems may be used for inspiration:
+- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
 
-The generic file operations do_sync_read/do_sync_write can be used to implement
-classic synchronous IO calls.
 
 Shortcomings
 ------------
-This implementation is limited to storage devices that are cpu addressable at
-all times (no highmem or such). It works well on rom/ram, but enhancements are
-needed to make it work with flash in read+write mode.
-Putting the Linux kernel and/or its modules on a xip filesystem does not mean
-they are not copied.
+
+Even if the kernel or its modules are stored on an filesystem that supports
+XIP on a block device that supports XIP, they will still be copied into RAM.
+
+Calling get_user_pages() on a range of user memory that has been mmaped
+from an XIP file will fail as there are no 'struct page' to describe
+those pages.  This problem is being worked on.  That means that O_DIRECT
+reads/writes to those memory ranges from a non-XIP file will fail (note
+that O_DIRECT reads/writes _of an XIP file_ do work, it is the memory
+that is being accessed that is key here).  Other things that will not
+work include RDMA, sendfile() and splice().
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index a52a5d2..a96ce89 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -977,7 +977,6 @@ const struct address_space_operations exofs_aops = {
 	.direct_IO	= NULL, /* TODO: Should be trivial to do */
 
 	/* With these NULL has special meaning or default is not exported */
-	.get_xip_mem	= NULL,
 	.migratepage	= NULL,
 	.launder_page	= NULL,
 	.is_partially_uptodate = NULL,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index c531700..57726ab 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -891,7 +891,6 @@ const struct address_space_operations ext2_aops = {
 
 const struct address_space_operations ext2_aops_xip = {
 	.bmap			= ext2_bmap,
-	.get_xip_mem		= ext2_get_xip_mem,
 	.direct_IO		= ext2_direct_IO,
 };
 
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index fa40091..ca745ff 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -22,27 +22,6 @@ static inline long __inode_direct_access(struct inode *inode, sector_t block,
 	return ops->direct_access(bdev, sector, kaddr, pfn, size);
 }
 
-static inline int
-__ext2_get_block(struct inode *inode, pgoff_t pgoff, int create,
-		   sector_t *result)
-{
-	struct buffer_head tmp;
-	int rc;
-
-	memset(&tmp, 0, sizeof(struct buffer_head));
-	tmp.b_size = 1 << inode->i_blkbits;
-	rc = ext2_get_block(inode, pgoff, &tmp, create);
-	*result = tmp.b_blocknr;
-
-	/* did we get a sparse block (hole in the file)? */
-	if (!tmp.b_blocknr && !rc) {
-		BUG_ON(create);
-		rc = -ENODATA;
-	}
-
-	return rc;
-}
-
 int
 ext2_clear_xip_target(struct inode *inode, sector_t block)
 {
@@ -69,19 +48,3 @@ void ext2_xip_verify_sb(struct super_block *sb)
 			     "not supported by bdev");
 	}
 }
-
-int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
-				void **kmem, unsigned long *pfn)
-{
-	long rc;
-	sector_t block;
-
-	/* first, retrieve the sector number */
-	rc = __ext2_get_block(mapping->host, pgoff, create, &block);
-	if (rc)
-		return rc;
-
-	/* retrieve address of the target data */
-	rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE);
-	return (rc < 0) ? rc : 0;
-}
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 29be737..0fa8b7f 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -14,11 +14,8 @@ static inline int ext2_use_xip (struct super_block *sb)
 	struct ext2_sb_info *sbi = EXT2_SB(sb);
 	return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
 }
-int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
-				void **, unsigned long *);
 #else
 #define ext2_xip_verify_sb(sb)			do { } while (0)
 #define ext2_use_xip(sb)			0
 #define ext2_clear_xip_target(inode, chain)	0
-#define ext2_get_xip_mem			NULL
 #endif
diff --git a/fs/open.c b/fs/open.c
index 4b3e1ed..4b16abe 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -665,11 +665,8 @@ int open_check_o_direct(struct file *f)
 {
 	/* NB: we're sure to have correct a_ops only after f_op->open */
 	if (f->f_flags & O_DIRECT) {
-		if (!f->f_mapping->a_ops ||
-		    ((!f->f_mapping->a_ops->direct_IO) &&
-		    (!f->f_mapping->a_ops->get_xip_mem))) {
+		if (!f->f_mapping->a_ops || !f->f_mapping->a_ops->direct_IO)
 			return -EINVAL;
-		}
 	}
 	return 0;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c77efa3..7c10319 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -372,8 +372,6 @@ struct address_space_operations {
 	void (*freepage)(struct page *);
 	ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
 			loff_t offset, unsigned long nr_segs);
-	int (*get_xip_mem)(struct address_space *, pgoff_t, int,
-						void **, unsigned long *);
 	/*
 	 * migrate the contents of a page to the specified target. If
 	 * migrate_mode is MIGRATE_ASYNC, it must not block.
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 3bcfd81..30d03bf 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -28,6 +28,7 @@
 SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
 {
 	struct fd f = fdget(fd);
+	struct inode *inode;
 	struct address_space *mapping;
 	struct backing_dev_info *bdi;
 	loff_t endbyte;			/* inclusive */
@@ -39,7 +40,8 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
 	if (!f.file)
 		return -EBADF;
 
-	if (S_ISFIFO(file_inode(f.file)->i_mode)) {
+	inode = file_inode(f.file);
+	if (S_ISFIFO(inode->i_mode)) {
 		ret = -ESPIPE;
 		goto out;
 	}
@@ -50,7 +52,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
 		goto out;
 	}
 
-	if (mapping->a_ops->get_xip_mem) {
+	if (IS_XIP(inode)) {
 		switch (advice) {
 		case POSIX_FADV_NORMAL:
 		case POSIX_FADV_RANDOM:
diff --git a/mm/madvise.c b/mm/madvise.c
index 539eeb9..9b487ca 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -236,7 +236,7 @@ static long madvise_willneed(struct vm_area_struct *vma,
 	if (!file)
 		return -EBADF;
 
-	if (file->f_mapping->a_ops->get_xip_mem) {
+	if (IS_XIP(file_inode(file))) {
 		/* no bad return value, but ignore advice */
 		return 0;
 	}
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 11/22] Replace ext2_clear_xip_target with xip_clear_blocks
  2014-01-16  1:24 [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Matthew Wilcox
                   ` (9 preceding siblings ...)
  2014-01-16  1:24 ` [PATCH v5 10/22] Remove get_xip_mem Matthew Wilcox
@ 2014-01-16  1:24 ` Matthew Wilcox
  2014-01-16  1:24 ` [PATCH v5 12/22] ext2: Remove ext2_xip_verify_sb() Matthew Wilcox
                   ` (17 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2014-01-16  1:24 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-ext4; +Cc: Matthew Wilcox

This is practically generic code; other filesystems will want to call
it from other places, but there's nothing ext2-specific about it.

Make it a little more generic by allowing it to take a count of the number
of bytes to zero rather than fixing it to a single page.  Thanks to Dave
Hansen for suggesting that I need to call cond_resched() if zeroing more
than one page.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext2/inode.c    |  8 +++++---
 fs/ext2/xip.c      | 23 -----------------------
 fs/ext2/xip.h      |  3 ---
 fs/xip.c           | 34 ++++++++++++++++++++++++++++++++++
 include/linux/fs.h |  6 ++++++
 5 files changed, 45 insertions(+), 29 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 57726ab..946ed65 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -733,10 +733,12 @@ static int ext2_get_blocks(struct inode *inode,
 
 	if (IS_XIP(inode)) {
 		/*
-		 * we need to clear the block
+		 * block must be initialised before we put it in the tree
+		 * so that it's not found by another thread before it's
+		 * initialised
 		 */
-		err = ext2_clear_xip_target (inode,
-			le32_to_cpu(chain[depth-1].key));
+		err = xip_clear_blocks(inode, le32_to_cpu(chain[depth-1].key),
+						count << inode->i_blkbits);
 		if (err) {
 			mutex_unlock(&ei->truncate_mutex);
 			goto cleanup;
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index ca745ff..132d4da 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,29 +13,6 @@
 #include "ext2.h"
 #include "xip.h"
 
-static inline long __inode_direct_access(struct inode *inode, sector_t block,
-				void **kaddr, unsigned long *pfn, long size)
-{
-	struct block_device *bdev = inode->i_sb->s_bdev;
-	const struct block_device_operations *ops = bdev->bd_disk->fops;
-	sector_t sector = block * (PAGE_SIZE / 512);
-	return ops->direct_access(bdev, sector, kaddr, pfn, size);
-}
-
-int
-ext2_clear_xip_target(struct inode *inode, sector_t block)
-{
-	void *kaddr;
-	unsigned long pfn;
-	long size;
-
-	size = __inode_direct_access(inode, block, &kaddr, &pfn, PAGE_SIZE);
-	if (size < 0)
-		return size;
-	clear_page(kaddr);
-	return 0;
-}
-
 void ext2_xip_verify_sb(struct super_block *sb)
 {
 	struct ext2_sb_info *sbi = EXT2_SB(sb);
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 0fa8b7f..e7b9f0a 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -7,8 +7,6 @@
 
 #ifdef CONFIG_EXT2_FS_XIP
 extern void ext2_xip_verify_sb (struct super_block *);
-extern int ext2_clear_xip_target (struct inode *, sector_t);
-
 static inline int ext2_use_xip (struct super_block *sb)
 {
 	struct ext2_sb_info *sbi = EXT2_SB(sb);
@@ -17,5 +15,4 @@ static inline int ext2_use_xip (struct super_block *sb)
 #else
 #define ext2_xip_verify_sb(sb)			do { } while (0)
 #define ext2_use_xip(sb)			0
-#define ext2_clear_xip_target(inode, chain)	0
 #endif
diff --git a/fs/xip.c b/fs/xip.c
index aacb6a8..3f5f081 100644
--- a/fs/xip.c
+++ b/fs/xip.c
@@ -21,8 +21,42 @@
 #include <linux/highmem.h>
 #include <linux/mm.h>
 #include <linux/mutex.h>
+#include <linux/sched.h>
 #include <linux/uio.h>
 
+int xip_clear_blocks(struct inode *inode, sector_t block, long size)
+{
+	struct block_device *bdev = inode->i_sb->s_bdev;
+	const struct block_device_operations *ops = bdev->bd_disk->fops;
+	sector_t sector = block << (inode->i_blkbits - 9);
+	unsigned long pfn;
+
+	might_sleep();
+	do {
+		void *addr;
+		long count = ops->direct_access(bdev, sector, &addr, &pfn,
+									size);
+		if (count < 0)
+			return count;
+		while (count >= PAGE_SIZE) {
+			clear_page(addr);
+			addr += PAGE_SIZE;
+			size -= PAGE_SIZE;
+			count -= PAGE_SIZE;
+			sector += PAGE_SIZE / 512;
+			cond_resched();
+		}
+		if (count > 0) {
+			memset(addr, 0, count);
+			sector += count / 512;
+			size -= count;
+		}
+	} while (size);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(xip_clear_blocks);
+
 static long xip_get_addr(struct inode *inode, struct buffer_head *bh,
 								void **addr)
 {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7c10319..c93671a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2508,12 +2508,18 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
 extern int nonseekable_open(struct inode * inode, struct file * filp);
 
 #ifdef CONFIG_FS_XIP
+int xip_clear_blocks(struct inode *, sector_t block, long size);
 int xip_truncate_page(struct inode *, loff_t from, get_block_t);
 ssize_t xip_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
 		loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
 int xip_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int xip_mkwrite(struct vm_area_struct *, struct vm_fault *, get_block_t);
 #else
+static inline int xip_clear_blocks(struct inode *i, sector_t blk, long sz)
+{
+	return 0;
+}
+
 static inline int xip_truncate_page(struct inode *i, loff_t frm, get_block_t gb)
 {
 	return 0;
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 12/22] ext2: Remove ext2_xip_verify_sb()
  2014-01-16  1:24 [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Matthew Wilcox
                   ` (10 preceding siblings ...)
  2014-01-16  1:24 ` [PATCH v5 11/22] Replace ext2_clear_xip_target with xip_clear_blocks Matthew Wilcox
@ 2014-01-16  1:24 ` Matthew Wilcox
  2014-01-16  1:24 ` [PATCH v5 13/22] ext2: Remove ext2_use_xip Matthew Wilcox
                   ` (16 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2014-01-16  1:24 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-ext4; +Cc: Matthew Wilcox

Jan Kara pointed out that calling ext2_xip_verify_sb() in ext2_remount()
doesn't make sense, since changing the XIP option on remount isn't
allowed.  It also doesn't make sense to re-check whether blocksize is
supported since it can't change between mounts.

Replace the call to ext2_xip_verify_sb() in ext2_fill_super() with the
equivalent check and delete the definition.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext2/super.c | 33 ++++++++++++---------------------
 fs/ext2/xip.c   | 12 ------------
 fs/ext2/xip.h   |  2 --
 3 files changed, 12 insertions(+), 35 deletions(-)

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 2885349..caf3f62 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -868,9 +868,6 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 		((EXT2_SB(sb)->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ?
 		 MS_POSIXACL : 0);
 
-	ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
-				    EXT2_MOUNT_XIP if not */
-
 	if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV &&
 	    (EXT2_HAS_COMPAT_FEATURE(sb, ~0U) ||
 	     EXT2_HAS_RO_COMPAT_FEATURE(sb, ~0U) ||
@@ -900,11 +897,17 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 
 	blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
 
-	if (ext2_use_xip(sb) && blocksize != PAGE_SIZE) {
-		if (!silent)
+	if (sbi->s_mount_opt & EXT2_MOUNT_XIP) {
+		if (blocksize != PAGE_SIZE) {
 			ext2_msg(sb, KERN_ERR,
-				"error: unsupported blocksize for xip");
-		goto failed_mount;
+					"error: unsupported blocksize for xip");
+			goto failed_mount;
+		}
+		if (!sb->s_bdev->bd_disk->fops->direct_access) {
+			ext2_msg(sb, KERN_ERR,
+					"error: device does not support xip");
+			goto failed_mount;
+		}
 	}
 
 	/* If the blocksize doesn't match, re-read the thing.. */
@@ -1249,7 +1252,6 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
 {
 	struct ext2_sb_info * sbi = EXT2_SB(sb);
 	struct ext2_super_block * es;
-	unsigned long old_mount_opt = sbi->s_mount_opt;
 	struct ext2_mount_options old_opts;
 	unsigned long old_sb_flags;
 	int err;
@@ -1273,22 +1275,11 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
 	sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
 		((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0);
 
-	ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
-				    EXT2_MOUNT_XIP if not */
-
-	if ((ext2_use_xip(sb)) && (sb->s_blocksize != PAGE_SIZE)) {
-		ext2_msg(sb, KERN_WARNING,
-			"warning: unsupported blocksize for xip");
-		err = -EINVAL;
-		goto restore_opts;
-	}
-
 	es = sbi->s_es;
-	if ((sbi->s_mount_opt ^ old_mount_opt) & EXT2_MOUNT_XIP) {
+	if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) {
 		ext2_msg(sb, KERN_WARNING, "warning: refusing change of "
 			 "xip flag with busy inodes while remounting");
-		sbi->s_mount_opt &= ~EXT2_MOUNT_XIP;
-		sbi->s_mount_opt |= old_mount_opt & EXT2_MOUNT_XIP;
+		sbi->s_mount_opt ^= EXT2_MOUNT_XIP;
 	}
 	if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) {
 		spin_unlock(&sbi->s_lock);
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index 132d4da..66ca113 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,15 +13,3 @@
 #include "ext2.h"
 #include "xip.h"
 
-void ext2_xip_verify_sb(struct super_block *sb)
-{
-	struct ext2_sb_info *sbi = EXT2_SB(sb);
-
-	if ((sbi->s_mount_opt & EXT2_MOUNT_XIP) &&
-	    !sb->s_bdev->bd_disk->fops->direct_access) {
-		sbi->s_mount_opt &= (~EXT2_MOUNT_XIP);
-		ext2_msg(sb, KERN_WARNING,
-			     "warning: ignoring xip option - "
-			     "not supported by bdev");
-	}
-}
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index e7b9f0a..87eeb04 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -6,13 +6,11 @@
  */
 
 #ifdef CONFIG_EXT2_FS_XIP
-extern void ext2_xip_verify_sb (struct super_block *);
 static inline int ext2_use_xip (struct super_block *sb)
 {
 	struct ext2_sb_info *sbi = EXT2_SB(sb);
 	return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
 }
 #else
-#define ext2_xip_verify_sb(sb)			do { } while (0)
 #define ext2_use_xip(sb)			0
 #endif
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 13/22] ext2: Remove ext2_use_xip
  2014-01-16  1:24 [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Matthew Wilcox
                   ` (11 preceding siblings ...)
  2014-01-16  1:24 ` [PATCH v5 12/22] ext2: Remove ext2_xip_verify_sb() Matthew Wilcox
@ 2014-01-16  1:24 ` Matthew Wilcox
  2014-01-16  1:24 ` [PATCH v5 14/22] ext2: Remove xip.c and xip.h Matthew Wilcox
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2014-01-16  1:24 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-ext4; +Cc: Matthew Wilcox

Replace ext2_use_xip() with test_opt(XIP) which expands to the same code

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext2/inode.c | 2 +-
 fs/ext2/namei.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 946ed65..9d6f2e1 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1393,7 +1393,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext2_file_inode_operations;
-		if (ext2_use_xip(inode->i_sb)) {
+		if (test_opt(inode->i_sb, XIP)) {
 			inode->i_mapping->a_ops = &ext2_aops_xip;
 			inode->i_fop = &ext2_xip_file_operations;
 		} else if (test_opt(inode->i_sb, NOBH)) {
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 256dd5f..0a7697a 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
 		return PTR_ERR(inode);
 
 	inode->i_op = &ext2_file_inode_operations;
-	if (ext2_use_xip(inode->i_sb)) {
+	if (test_opt(inode->i_sb, XIP)) {
 		inode->i_mapping->a_ops = &ext2_aops_xip;
 		inode->i_fop = &ext2_xip_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
@@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 		return PTR_ERR(inode);
 
 	inode->i_op = &ext2_file_inode_operations;
-	if (ext2_use_xip(inode->i_sb)) {
+	if (test_opt(inode->i_sb, XIP)) {
 		inode->i_mapping->a_ops = &ext2_aops_xip;
 		inode->i_fop = &ext2_xip_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 14/22] ext2: Remove xip.c and xip.h
  2014-01-16  1:24 [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Matthew Wilcox
                   ` (12 preceding siblings ...)
  2014-01-16  1:24 ` [PATCH v5 13/22] ext2: Remove ext2_use_xip Matthew Wilcox
@ 2014-01-16  1:24 ` Matthew Wilcox
  2014-01-16  1:24 ` [PATCH v5 15/22] Remove CONFIG_EXT2_FS_XIP Matthew Wilcox
                   ` (14 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2014-01-16  1:24 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-ext4; +Cc: Matthew Wilcox

These files are now empty, so delete them

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext2/Makefile |  1 -
 fs/ext2/inode.c  |  1 -
 fs/ext2/namei.c  |  1 -
 fs/ext2/super.c  |  1 -
 fs/ext2/xip.c    | 15 ---------------
 fs/ext2/xip.h    | 16 ----------------
 6 files changed, 35 deletions(-)
 delete mode 100644 fs/ext2/xip.c
 delete mode 100644 fs/ext2/xip.h

diff --git a/fs/ext2/Makefile b/fs/ext2/Makefile
index f42af45..445b0e9 100644
--- a/fs/ext2/Makefile
+++ b/fs/ext2/Makefile
@@ -10,4 +10,3 @@ ext2-y := balloc.o dir.o file.o ialloc.o inode.o \
 ext2-$(CONFIG_EXT2_FS_XATTR)	 += xattr.o xattr_user.o xattr_trusted.o
 ext2-$(CONFIG_EXT2_FS_POSIX_ACL) += acl.o
 ext2-$(CONFIG_EXT2_FS_SECURITY)	 += xattr_security.o
-ext2-$(CONFIG_EXT2_FS_XIP)	 += xip.o
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 9d6f2e1..af0fa94 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -34,7 +34,6 @@
 #include <linux/aio.h>
 #include "ext2.h"
 #include "acl.h"
-#include "xip.h"
 #include "xattr.h"
 
 static int __ext2_write_inode(struct inode *inode, int do_sync);
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 0a7697a..76d33b9 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -35,7 +35,6 @@
 #include "ext2.h"
 #include "xattr.h"
 #include "acl.h"
-#include "xip.h"
 
 static inline int ext2_add_nondir(struct dentry *dentry, struct inode *inode)
 {
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index caf3f62..fd62082 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -35,7 +35,6 @@
 #include "ext2.h"
 #include "xattr.h"
 #include "acl.h"
-#include "xip.h"
 
 static void ext2_sync_super(struct super_block *sb,
 			    struct ext2_super_block *es, int wait);
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
deleted file mode 100644
index 66ca113..0000000
--- a/fs/ext2/xip.c
+++ /dev/null
@@ -1,15 +0,0 @@
-/*
- *  linux/fs/ext2/xip.c
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte (cotte@de.ibm.com)
- */
-
-#include <linux/mm.h>
-#include <linux/fs.h>
-#include <linux/genhd.h>
-#include <linux/buffer_head.h>
-#include <linux/blkdev.h>
-#include "ext2.h"
-#include "xip.h"
-
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
deleted file mode 100644
index 87eeb04..0000000
--- a/fs/ext2/xip.h
+++ /dev/null
@@ -1,16 +0,0 @@
-/*
- *  linux/fs/ext2/xip.h
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte (cotte@de.ibm.com)
- */
-
-#ifdef CONFIG_EXT2_FS_XIP
-static inline int ext2_use_xip (struct super_block *sb)
-{
-	struct ext2_sb_info *sbi = EXT2_SB(sb);
-	return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
-}
-#else
-#define ext2_use_xip(sb)			0
-#endif
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 15/22] Remove CONFIG_EXT2_FS_XIP
  2014-01-16  1:24 [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Matthew Wilcox
                   ` (13 preceding siblings ...)
  2014-01-16  1:24 ` [PATCH v5 14/22] ext2: Remove xip.c and xip.h Matthew Wilcox
@ 2014-01-16  1:24 ` Matthew Wilcox
  2014-01-16  1:24 ` [PATCH v5 16/22] ext2: Remove ext2_aops_xip Matthew Wilcox
                   ` (13 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2014-01-16  1:24 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-ext4; +Cc: Matthew Wilcox

The fewer Kconfig options we have the better.  Use the generic CONFIG_FS_XIP
to enable XIP support in ext2 as well as in the core.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/Kconfig      | 21 ++++++++++++++-------
 fs/ext2/Kconfig | 11 -----------
 fs/ext2/file.c  |  4 ++--
 fs/ext2/super.c |  4 ++--
 4 files changed, 18 insertions(+), 22 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index c229f82..1e01dda 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -13,13 +13,6 @@ if BLOCK
 source "fs/ext2/Kconfig"
 source "fs/ext3/Kconfig"
 source "fs/ext4/Kconfig"
-
-config FS_XIP
-# execute in place
-	bool
-	depends on EXT2_FS_XIP
-	default y
-
 source "fs/jbd/Kconfig"
 source "fs/jbd2/Kconfig"
 
@@ -40,6 +33,20 @@ source "fs/ocfs2/Kconfig"
 source "fs/btrfs/Kconfig"
 source "fs/nilfs2/Kconfig"
 
+config FS_XIP
+	bool "Execute in place support"
+	depends on MMU
+	help
+	  Execute in place (XIP) can be used on memory-backed block devices.
+	  If the block device supports XIP and the filesystem supports XIP,
+	  then you can avoid using the pagecache to buffer I/Os.  Turning
+	  on this option will compile in support for XIP; you will need to
+	  mount the filesystem using the -o xip option.
+
+	  If you do not have a block device that is capable of using this,
+	  or if unsure, say N.  Saying Y will increase the size of the kernel
+	  by about 2kB.
+
 endif # BLOCK
 
 # Posix ACL utility routines
diff --git a/fs/ext2/Kconfig b/fs/ext2/Kconfig
index 14a6780..c634874e 100644
--- a/fs/ext2/Kconfig
+++ b/fs/ext2/Kconfig
@@ -42,14 +42,3 @@ config EXT2_FS_SECURITY
 
 	  If you are not using a security module that requires using
 	  extended attributes for file security labels, say N.
-
-config EXT2_FS_XIP
-	bool "Ext2 execute in place support"
-	depends on EXT2_FS && MMU
-	help
-	  Execute in place can be used on memory-backed block devices. If you
-	  enable this option, you can select to mount block devices which are
-	  capable of this feature without using the page cache.
-
-	  If you do not use a block device that is capable of using this,
-	  or if unsure, say N.
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 9e88388..8cf2c5f 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -25,7 +25,7 @@
 #include "xattr.h"
 #include "acl.h"
 
-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_XIP
 static int ext2_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	return xip_fault(vma, vmf, ext2_get_block);
@@ -109,7 +109,7 @@ const struct file_operations ext2_file_operations = {
 	.splice_write	= generic_file_splice_write,
 };
 
-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_XIP
 const struct file_operations ext2_xip_file_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= do_sync_read,
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index fd62082..a2caffd 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -287,7 +287,7 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root)
 		seq_puts(seq, ",grpquota");
 #endif
 
-#if defined(CONFIG_EXT2_FS_XIP)
+#ifdef CONFIG_FS_XIP
 	if (sbi->s_mount_opt & EXT2_MOUNT_XIP)
 		seq_puts(seq, ",xip");
 #endif
@@ -549,7 +549,7 @@ static int parse_options(char *options, struct super_block *sb)
 			break;
 #endif
 		case Opt_xip:
-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_XIP
 			set_opt (sbi->s_mount_opt, XIP);
 #else
 			ext2_msg(sb, KERN_INFO, "xip option not supported");
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 16/22] ext2: Remove ext2_aops_xip
  2014-01-16  1:24 [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Matthew Wilcox
                   ` (14 preceding siblings ...)
  2014-01-16  1:24 ` [PATCH v5 15/22] Remove CONFIG_EXT2_FS_XIP Matthew Wilcox
@ 2014-01-16  1:24 ` Matthew Wilcox
  2014-01-16  1:24 ` [PATCH v5 17/22] xip: Add xip_zero_page_range Matthew Wilcox
                   ` (12 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2014-01-16  1:24 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-ext4; +Cc: Matthew Wilcox

We shouldn't need a special address_space_operations any more

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext2/ext2.h  | 1 -
 fs/ext2/inode.c | 7 +------
 fs/ext2/namei.c | 4 ++--
 3 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index d9a17d0..f39599e 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -789,7 +789,6 @@ extern const struct file_operations ext2_xip_file_operations;
 
 /* inode.c */
 extern const struct address_space_operations ext2_aops;
-extern const struct address_space_operations ext2_aops_xip;
 extern const struct address_space_operations ext2_nobh_aops;
 
 /* namei.c */
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index af0fa94..9eadd78 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -890,11 +890,6 @@ const struct address_space_operations ext2_aops = {
 	.error_remove_page	= generic_error_remove_page,
 };
 
-const struct address_space_operations ext2_aops_xip = {
-	.bmap			= ext2_bmap,
-	.direct_IO		= ext2_direct_IO,
-};
-
 const struct address_space_operations ext2_nobh_aops = {
 	.readpage		= ext2_readpage,
 	.readpages		= ext2_readpages,
@@ -1393,7 +1388,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext2_file_inode_operations;
 		if (test_opt(inode->i_sb, XIP)) {
-			inode->i_mapping->a_ops = &ext2_aops_xip;
+			inode->i_mapping->a_ops = &ext2_aops;
 			inode->i_fop = &ext2_xip_file_operations;
 		} else if (test_opt(inode->i_sb, NOBH)) {
 			inode->i_mapping->a_ops = &ext2_nobh_aops;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 76d33b9..2fc20ca 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
 
 	inode->i_op = &ext2_file_inode_operations;
 	if (test_opt(inode->i_sb, XIP)) {
-		inode->i_mapping->a_ops = &ext2_aops_xip;
+		inode->i_mapping->a_ops = &ext2_aops;
 		inode->i_fop = &ext2_xip_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
 		inode->i_mapping->a_ops = &ext2_nobh_aops;
@@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 
 	inode->i_op = &ext2_file_inode_operations;
 	if (test_opt(inode->i_sb, XIP)) {
-		inode->i_mapping->a_ops = &ext2_aops_xip;
+		inode->i_mapping->a_ops = &ext2_aops;
 		inode->i_fop = &ext2_xip_file_operations;
 	} else if (test_opt(inode->i_sb, NOBH)) {
 		inode->i_mapping->a_ops = &ext2_nobh_aops;
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 17/22] xip: Add xip_zero_page_range
  2014-01-16  1:24 [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Matthew Wilcox
                   ` (15 preceding siblings ...)
  2014-01-16  1:24 ` [PATCH v5 16/22] ext2: Remove ext2_aops_xip Matthew Wilcox
@ 2014-01-16  1:24 ` Matthew Wilcox
  2014-01-16  1:24 ` [PATCH v5 18/22] ext4: Make ext4_block_zero_page_range static Matthew Wilcox
                   ` (11 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2014-01-16  1:24 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-ext4
  Cc: Matthew Wilcox, Ross Zwisler

This new function allows us to support hole-punch for XIP files by zeroing
a partial page, as opposed to the xip_truncate_page() function which can
only truncate to the end of the page.  Reimplement xip_truncate_page() as
a macro that calls xip_zero_page_range().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
[ported to 3.13-rc2]
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 Documentation/filesystems/xip.txt |  1 +
 fs/xip.c                          | 15 +++++++++------
 include/linux/fs.h                | 11 +++++++++--
 3 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
index 520e73a..a8bccb6 100644
--- a/Documentation/filesystems/xip.txt
+++ b/Documentation/filesystems/xip.txt
@@ -55,6 +55,7 @@ Filesystem support consists of
   for fault and page_mkwrite (which should probably call xip_fault() and
   xip_mkwrite(), passing the appropriate get_block() callback)
 - calling xip_truncate_page() instead of block_truncate_page() for XIP files
+- calling xip_zero_page_range() instead of zero_user() for XIP files
 - ensuring that there is sufficient locking between reads, writes,
   truncates and page faults
 
diff --git a/fs/xip.c b/fs/xip.c
index 3f5f081..9087e0f 100644
--- a/fs/xip.c
+++ b/fs/xip.c
@@ -357,13 +357,16 @@ int xip_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 EXPORT_SYMBOL_GPL(xip_mkwrite);
 
 /**
- * xip_truncate_page - handle a partial page being truncated in an XIP file
+ * xip_zero_page_range - zero a range within a page of an XIP file
  * @inode: The file being truncated
  * @from: The file offset that is being truncated to
+ * @length: The number of bytes to zero
  * @get_block: The filesystem method used to translate file offsets to blocks
  *
- * Similar to block_truncate_page(), this function can be called by a
- * filesystem when it is truncating an XIP file to handle the partial page.
+ * This function can be called by a filesystem when it is zeroing part of a
+ * page in an XIP file.  This is intended for hole-punch operations.  If
+ * you are truncating a file, the helper function xip_truncate_page() may be
+ * more convenient.
  *
  * We work in terms of PAGE_CACHE_SIZE here for commonality with
  * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
@@ -371,12 +374,12 @@ EXPORT_SYMBOL_GPL(xip_mkwrite);
  * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
  * since the file might be mmaped.
  */
-int xip_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
+int xip_zero_page_range(struct inode *inode, loff_t from, unsigned length,
+							get_block_t get_block)
 {
 	struct buffer_head bh;
 	pgoff_t index = from >> PAGE_CACHE_SHIFT;
 	unsigned offset = from & (PAGE_CACHE_SIZE-1);
-	unsigned length = PAGE_CACHE_ALIGN(from) - from;
 	int err;
 
 	/* Block boundary? Nothing to do */
@@ -398,4 +401,4 @@ int xip_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
 
 	return 0;
 }
-EXPORT_SYMBOL_GPL(xip_truncate_page);
+EXPORT_SYMBOL_GPL(xip_zero_page_range);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c93671a..04338a3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2509,7 +2509,7 @@ extern int nonseekable_open(struct inode * inode, struct file * filp);
 
 #ifdef CONFIG_FS_XIP
 int xip_clear_blocks(struct inode *, sector_t block, long size);
-int xip_truncate_page(struct inode *, loff_t from, get_block_t);
+int xip_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 ssize_t xip_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
 		loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
 int xip_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
@@ -2520,7 +2520,8 @@ static inline int xip_clear_blocks(struct inode *i, sector_t blk, long sz)
 	return 0;
 }
 
-static inline int xip_truncate_page(struct inode *i, loff_t frm, get_block_t gb)
+static inline int xip_zero_page_range(struct inode *inode, loff_t from,
+						unsigned len, get_block_t gb)
 {
 	return 0;
 }
@@ -2533,6 +2534,12 @@ static inline ssize_t xip_do_io(int rw, struct kiocb *iocb, struct inode *inode,
 }
 #endif
 
+/* Can't be a function because PAGE_CACHE_ALIGN is defined in pagemap.h */
+#define xip_truncate_page(inode, from, get_block)	\
+	xip_zero_page_range(inode, from, PAGE_CACHE_ALIGN(from) - from, \
+					get_block)
+
+
 #ifdef CONFIG_BLOCK
 typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
 			    loff_t file_offset);
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 18/22] ext4: Make ext4_block_zero_page_range static
  2014-01-16  1:24 [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Matthew Wilcox
                   ` (16 preceding siblings ...)
  2014-01-16  1:24 ` [PATCH v5 17/22] xip: Add xip_zero_page_range Matthew Wilcox
@ 2014-01-16  1:24 ` Matthew Wilcox
  2014-01-16  1:24 ` [PATCH v5 19/22] ext4: Add XIP functionality Matthew Wilcox
                   ` (10 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2014-01-16  1:24 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-ext4; +Cc: Matthew Wilcox

It's only called within inode.c, so make it static, remove its prototype
from ext4.h and move it above all of its callers so it doesn't need a
prototype within inode.c.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext4/ext4.h  |  2 --
 fs/ext4/inode.c | 42 +++++++++++++++++++++---------------------
 2 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e618503..a48d367 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2121,8 +2121,6 @@ extern int ext4_writepage_trans_blocks(struct inode *);
 extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
 extern int ext4_block_truncate_page(handle_t *handle,
 		struct address_space *mapping, loff_t from);
-extern int ext4_block_zero_page_range(handle_t *handle,
-		struct address_space *mapping, loff_t from, loff_t length);
 extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
 			     loff_t lstart, loff_t lend);
 extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0757634..c767666 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3324,33 +3324,13 @@ void ext4_set_aops(struct inode *inode)
 }
 
 /*
- * ext4_block_truncate_page() zeroes out a mapping from file offset `from'
- * up to the end of the block which corresponds to `from'.
- * This required during truncate. We need to physically zero the tail end
- * of that block so it doesn't yield old data if the file is later grown.
- */
-int ext4_block_truncate_page(handle_t *handle,
-		struct address_space *mapping, loff_t from)
-{
-	unsigned offset = from & (PAGE_CACHE_SIZE-1);
-	unsigned length;
-	unsigned blocksize;
-	struct inode *inode = mapping->host;
-
-	blocksize = inode->i_sb->s_blocksize;
-	length = blocksize - (offset & (blocksize - 1));
-
-	return ext4_block_zero_page_range(handle, mapping, from, length);
-}
-
-/*
  * ext4_block_zero_page_range() zeros out a mapping of length 'length'
  * starting from file offset 'from'.  The range to be zero'd must
  * be contained with in one block.  If the specified range exceeds
  * the end of the block it will be shortened to end of the block
  * that cooresponds to 'from'
  */
-int ext4_block_zero_page_range(handle_t *handle,
+static int ext4_block_zero_page_range(handle_t *handle,
 		struct address_space *mapping, loff_t from, loff_t length)
 {
 	ext4_fsblk_t index = from >> PAGE_CACHE_SHIFT;
@@ -3440,6 +3420,26 @@ unlock:
 	return err;
 }
 
+/*
+ * ext4_block_truncate_page() zeroes out a mapping from file offset `from'
+ * up to the end of the block which corresponds to `from'.
+ * This required during truncate. We need to physically zero the tail end
+ * of that block so it doesn't yield old data if the file is later grown.
+ */
+int ext4_block_truncate_page(handle_t *handle,
+		struct address_space *mapping, loff_t from)
+{
+	unsigned offset = from & (PAGE_CACHE_SIZE-1);
+	unsigned length;
+	unsigned blocksize;
+	struct inode *inode = mapping->host;
+
+	blocksize = inode->i_sb->s_blocksize;
+	length = blocksize - (offset & (blocksize - 1));
+
+	return ext4_block_zero_page_range(handle, mapping, from, length);
+}
+
 int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
 			     loff_t lstart, loff_t length)
 {
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 19/22] ext4: Add XIP functionality
  2014-01-16  1:24 [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Matthew Wilcox
                   ` (17 preceding siblings ...)
  2014-01-16  1:24 ` [PATCH v5 18/22] ext4: Make ext4_block_zero_page_range static Matthew Wilcox
@ 2014-01-16  1:24 ` Matthew Wilcox
  2014-01-16  1:24 ` [PATCH v5 20/22] ext4: Fix typos Matthew Wilcox
                   ` (9 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2014-01-16  1:24 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-ext4
  Cc: Ross Zwisler, Matthew Wilcox

From: Ross Zwisler <ross.zwisler@linux.intel.com>

This is a port of the XIP functionality found in the current version of
ext2.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
[heavily tweaked]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 Documentation/filesystems/ext4.txt |  2 ++
 Documentation/filesystems/xip.txt  |  1 +
 fs/ext4/ext4.h                     |  2 ++
 fs/ext4/file.c                     | 53 +++++++++++++++++++++++++++----
 fs/ext4/indirect.c                 | 19 +++++++----
 fs/ext4/inode.c                    | 64 +++++++++++++++++++++++++++-----------
 fs/ext4/namei.c                    | 10 ++++--
 fs/ext4/super.c                    | 39 ++++++++++++++++++++++-
 8 files changed, 157 insertions(+), 33 deletions(-)

diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 919a329..06dab5e 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -386,6 +386,8 @@ max_dir_size_kb=n	This limits the size of directories so that any
 i_version		Enable 64-bit inode version support. This option is
 			off by default.
 
+xip			Use execute in place (no caching) if possible
+
 Data Mode
 =========
 There are 3 different data modes:
diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
index a8bccb6..b158de6 100644
--- a/Documentation/filesystems/xip.txt
+++ b/Documentation/filesystems/xip.txt
@@ -67,6 +67,7 @@ work correctly.
 
 These filesystems may be used for inspiration:
 - ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
+- ext4: the fourth extended filesystem, see Documentation/filesystems/ext4.txt
 
 
 Shortcomings
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index a48d367..5b160da 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -954,6 +954,7 @@ struct ext4_inode_info {
 #define EXT4_MOUNT_ERRORS_MASK		0x00070
 #define EXT4_MOUNT_MINIX_DF		0x00080	/* Mimics the Minix statfs */
 #define EXT4_MOUNT_NOLOAD		0x00100	/* Don't use existing journal*/
+#define EXT4_MOUNT_XIP			0x00200	/* Execute in place */
 #define EXT4_MOUNT_DATA_FLAGS		0x00C00	/* Mode for data writes: */
 #define EXT4_MOUNT_JOURNAL_DATA		0x00400	/* Write data to journal */
 #define EXT4_MOUNT_ORDERED_DATA		0x00800	/* Flush data before commit */
@@ -2569,6 +2570,7 @@ extern const struct file_operations ext4_dir_operations;
 /* file.c */
 extern const struct inode_operations ext4_file_inode_operations;
 extern const struct file_operations ext4_file_operations;
+extern const struct file_operations ext4_xip_file_operations;
 extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin);
 extern void ext4_unwritten_wait(struct inode *inode);
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 3da2194..fb18af0 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -190,7 +190,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
 		}
 	}
 
-	if (unlikely(iocb->ki_filp->f_flags & O_DIRECT))
+	if (io_is_direct(iocb->ki_filp))
 		ret = ext4_file_dio_write(iocb, iov, nr_segs, pos);
 	else
 		ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
@@ -198,6 +198,27 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
 	return ret;
 }
 
+#ifdef CONFIG_FS_XIP
+static int ext4_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return xip_fault(vma, vmf, ext4_get_block);
+					/* Is this the right get_block? */
+}
+
+static int ext4_xip_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return xip_mkwrite(vma, vmf, ext4_get_block);
+}
+
+static const struct vm_operations_struct ext4_xip_vm_ops = {
+	.fault		= ext4_xip_fault,
+	.page_mkwrite	= ext4_xip_mkwrite,
+	.remap_pages	= generic_file_remap_pages,
+};
+#else
+#define ext4_xip_vm_ops	ext4_file_vm_ops
+#endif
+
 static const struct vm_operations_struct ext4_file_vm_ops = {
 	.fault		= filemap_fault,
 	.page_mkwrite   = ext4_page_mkwrite,
@@ -206,12 +227,13 @@ static const struct vm_operations_struct ext4_file_vm_ops = {
 
 static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
-	struct address_space *mapping = file->f_mapping;
-
-	if (!mapping->a_ops->readpage)
-		return -ENOEXEC;
 	file_accessed(file);
-	vma->vm_ops = &ext4_file_vm_ops;
+	if (IS_XIP(file_inode(file))) {
+		vma->vm_ops = &ext4_xip_vm_ops;
+		vma->vm_flags |= VM_MIXEDMAP;
+	} else {
+		vma->vm_ops = &ext4_file_vm_ops;
+	}
 	return 0;
 }
 
@@ -609,6 +631,25 @@ const struct file_operations ext4_file_operations = {
 	.fallocate	= ext4_fallocate,
 };
 
+#ifdef CONFIG_FS_XIP
+const struct file_operations ext4_xip_file_operations = {
+	.llseek		= ext4_llseek,
+	.read		= do_sync_read,
+	.write		= do_sync_write,
+	.aio_read	= generic_file_aio_read,
+	.aio_write	= ext4_file_write,
+	.unlocked_ioctl = ext4_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= ext4_compat_ioctl,
+#endif
+	.mmap		= ext4_file_mmap,
+	.open		= ext4_file_open,
+	.release	= ext4_release_file,
+	.fsync		= ext4_sync_file,
+	.fallocate	= ext4_fallocate,
+};
+#endif
+
 const struct inode_operations ext4_file_inode_operations = {
 	.setattr	= ext4_setattr,
 	.getattr	= ext4_getattr,
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 594009f..ef2bdef 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -686,15 +686,22 @@ retry:
 			inode_dio_done(inode);
 			goto locked;
 		}
-		ret = __blockdev_direct_IO(rw, iocb, inode,
-				 inode->i_sb->s_bdev, iov,
-				 offset, nr_segs,
-				 ext4_get_block, NULL, NULL, 0);
+		if (IS_XIP(inode))
+			ret = xip_do_io(rw, iocb, inode, iov, offset, nr_segs,
+					ext4_get_block, NULL, 0);
+		else
+			ret = __blockdev_direct_IO(rw, iocb, inode,
+					inode->i_sb->s_bdev, iov, offset,
+					nr_segs, ext4_get_block, NULL, NULL, 0);
 		inode_dio_done(inode);
 	} else {
 locked:
-		ret = blockdev_direct_IO(rw, iocb, inode, iov,
-				 offset, nr_segs, ext4_get_block);
+		if (IS_XIP(inode))
+			ret = xip_do_io(rw, iocb, inode, iov, offset, nr_segs,
+					ext4_get_block, NULL, 0);
+		else
+			ret = blockdev_direct_IO(rw, iocb, inode, iov,
+					offset, nr_segs, ext4_get_block);
 
 		if (unlikely((rw & WRITE) && ret < 0)) {
 			loff_t isize = i_size_read(inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c767666..8b73d77 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -663,6 +663,18 @@ found:
 			WARN_ON(1);
 		}
 
+		/* this is probably wrong for ext4.  unlike ext2, ext4 supports
+		 * uninitialised extents, so we should probably be hooking
+		 * into the "make it initialised" code instead. */
+		if (IS_XIP(inode)) {
+			ret = xip_clear_blocks(inode, map->m_pblk,
+						map->m_len << inode->i_blkbits);
+			if (ret) {
+				retval = ret;
+				goto has_zeroout;
+			}
+		}
+
 		/*
 		 * If the extent has been zeroed out, we don't need to update
 		 * extent status tree.
@@ -3152,13 +3164,14 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
 		get_block_func = ext4_get_block_write;
 		dio_flags = DIO_LOCKING;
 	}
-	ret = __blockdev_direct_IO(rw, iocb, inode,
-				   inode->i_sb->s_bdev, iov,
-				   offset, nr_segs,
-				   get_block_func,
-				   ext4_end_io_dio,
-				   NULL,
-				   dio_flags);
+	if (IS_XIP(inode))
+		ret = xip_do_io(rw, iocb, inode, iov, offset, nr_segs,
+				get_block_func, ext4_end_io_dio, dio_flags);
+	else
+		ret = __blockdev_direct_IO(rw, iocb, inode,
+					   inode->i_sb->s_bdev, iov, offset,
+					   nr_segs, get_block_func,
+					   ext4_end_io_dio, NULL, dio_flags);
 
 	/*
 	 * Put our reference to io_end. This can free the io_end structure e.g.
@@ -3323,14 +3336,7 @@ void ext4_set_aops(struct inode *inode)
 		inode->i_mapping->a_ops = &ext4_aops;
 }
 
-/*
- * ext4_block_zero_page_range() zeros out a mapping of length 'length'
- * starting from file offset 'from'.  The range to be zero'd must
- * be contained with in one block.  If the specified range exceeds
- * the end of the block it will be shortened to end of the block
- * that cooresponds to 'from'
- */
-static int ext4_block_zero_page_range(handle_t *handle,
+static int __ext4_block_zero_page_range(handle_t *handle,
 		struct address_space *mapping, loff_t from, loff_t length)
 {
 	ext4_fsblk_t index = from >> PAGE_CACHE_SHIFT;
@@ -3421,6 +3427,22 @@ unlock:
 }
 
 /*
+ * ext4_block_zero_page_range() zeros out a mapping of length 'length'
+ * starting from file offset 'from'.  The range to be zero'd must
+ * be contained with in one block.  If the specified range exceeds
+ * the end of the block it will be shortened to end of the block
+ * that cooresponds to 'from'
+ */
+static int ext4_block_zero_page_range(handle_t *handle,
+		struct address_space *mapping, loff_t from, loff_t length)
+{
+	struct inode *inode = mapping->host;
+	if (IS_XIP(inode))
+		return xip_zero_page_range(inode, from, length, ext4_get_block);
+	return __ext4_block_zero_page_range(handle, mapping, from, length);
+}
+
+/*
  * ext4_block_truncate_page() zeroes out a mapping from file offset `from'
  * up to the end of the block which corresponds to `from'.
  * This required during truncate. We need to physically zero the tail end
@@ -3939,7 +3961,8 @@ void ext4_set_inode_flags(struct inode *inode)
 {
 	unsigned int flags = EXT4_I(inode)->i_flags;
 
-	inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+	inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME |
+				S_DIRSYNC | S_XIP);
 	if (flags & EXT4_SYNC_FL)
 		inode->i_flags |= S_SYNC;
 	if (flags & EXT4_APPEND_FL)
@@ -3950,6 +3973,8 @@ void ext4_set_inode_flags(struct inode *inode)
 		inode->i_flags |= S_NOATIME;
 	if (flags & EXT4_DIRSYNC_FL)
 		inode->i_flags |= S_DIRSYNC;
+	if (test_opt(inode->i_sb, XIP))
+		inode->i_flags |= S_XIP;
 }
 
 /* Propagate flags from i_flags to EXT4_I(inode)->i_flags */
@@ -4201,7 +4226,10 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
 
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		inode->i_fop = &ext4_file_operations;
+		if (test_opt(inode->i_sb, XIP))
+			inode->i_fop = &ext4_xip_file_operations;
+		else
+			inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 	} else if (S_ISDIR(inode->i_mode)) {
 		inode->i_op = &ext4_dir_inode_operations;
@@ -4653,7 +4681,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 		 * Truncate pagecache after we've waited for commit
 		 * in data=journal mode to make pages freeable.
 		 */
-			truncate_pagecache(inode, inode->i_size);
+		truncate_pagecache(inode, inode->i_size);
 	}
 	/*
 	 * We want to call ext4_truncate() even if attr->ia_size ==
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 5a0408d..ac68129 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2250,7 +2250,10 @@ retry:
 	err = PTR_ERR(inode);
 	if (!IS_ERR(inode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		inode->i_fop = &ext4_file_operations;
+		if (test_opt(inode->i_sb, XIP))
+			inode->i_fop = &ext4_xip_file_operations;
+		else
+			inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 		err = ext4_add_nondir(handle, dentry, inode);
 		if (!err && IS_DIRSYNC(dir))
@@ -2314,7 +2317,10 @@ retry:
 	err = PTR_ERR(inode);
 	if (!IS_ERR(inode)) {
 		inode->i_op = &ext4_file_inode_operations;
-		inode->i_fop = &ext4_file_operations;
+		if (test_opt(inode->i_sb, XIP))
+			inode->i_fop = &ext4_xip_file_operations;
+		else
+			inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 		d_tmpfile(dentry, inode);
 		err = ext4_orphan_add(handle, inode);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c977f4e..309a1a3 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1156,7 +1156,7 @@ enum {
 	Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
 	Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_jqfmt_vfsv1, Opt_quota,
 	Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
-	Opt_usrquota, Opt_grpquota, Opt_i_version,
+	Opt_usrquota, Opt_grpquota, Opt_i_version, Opt_xip,
 	Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
 	Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
 	Opt_inode_readahead_blks, Opt_journal_ioprio,
@@ -1218,6 +1218,7 @@ static const match_table_t tokens = {
 	{Opt_barrier, "barrier"},
 	{Opt_nobarrier, "nobarrier"},
 	{Opt_i_version, "i_version"},
+	{Opt_xip, "xip"},
 	{Opt_stripe, "stripe=%u"},
 	{Opt_delalloc, "delalloc"},
 	{Opt_nodelalloc, "nodelalloc"},
@@ -1400,6 +1401,7 @@ static const struct mount_opts {
 	{Opt_min_batch_time, 0, MOPT_GTE0},
 	{Opt_inode_readahead_blks, 0, MOPT_GTE0},
 	{Opt_init_itable, 0, MOPT_GTE0},
+	{Opt_xip, EXT4_MOUNT_XIP, MOPT_SET},
 	{Opt_stripe, 0, MOPT_GTE0},
 	{Opt_resuid, 0, MOPT_GTE0},
 	{Opt_resgid, 0, MOPT_GTE0},
@@ -1638,6 +1640,11 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
 		}
 		sbi->s_jquota_fmt = m->mount_opt;
 #endif
+#ifndef CONFIG_FS_XIP
+	} else if (token == Opt_xip) {
+		ext4_msg(sb, KERN_INFO, "xip option not supported");
+		return -1;
+#endif
 	} else {
 		if (!args->from)
 			arg = 1;
@@ -3551,6 +3558,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 				 "both data=journal and dioread_nolock");
 			goto failed_mount;
 		}
+		if (test_opt(sb, XIP)) {
+			ext4_msg(sb, KERN_ERR, "can't mount with "
+				 "both data=journal and xip");
+			goto failed_mount;
+		}
 		if (test_opt(sb, DELALLOC))
 			clear_opt(sb, DELALLOC);
 	}
@@ -3604,6 +3616,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		goto failed_mount;
 	}
 
+	if (sbi->s_mount_opt & EXT4_MOUNT_XIP) {
+		if (blocksize != PAGE_SIZE) {
+			ext4_msg(sb, KERN_ERR,
+					"error: unsupported blocksize for xip");
+			goto failed_mount;
+		}
+		if (!sb->s_bdev->bd_disk->fops->direct_access) {
+			ext4_msg(sb, KERN_ERR,
+					"error: device does not support xip");
+			goto failed_mount;
+		}
+	}
+
 	if (sb->s_blocksize != blocksize) {
 		/* Validate the filesystem blocksize */
 		if (!sb_set_blocksize(sb, blocksize)) {
@@ -4798,6 +4823,18 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
 			err = -EINVAL;
 			goto restore_opts;
 		}
+		if (test_opt(sb, XIP)) {
+			ext4_msg(sb, KERN_ERR, "can't mount with "
+				 "both data=journal and xip");
+			err = -EINVAL;
+			goto restore_opts;
+		}
+	}
+
+	if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT4_MOUNT_XIP) {
+		ext4_msg(sb, KERN_WARNING, "warning: refusing change of "
+			"xip flag with busy inodes while remounting");
+		sbi->s_mount_opt ^= EXT4_MOUNT_XIP;
 	}
 
 	if (sbi->s_mount_flags & EXT4_MF_FS_ABORTED)
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 20/22] ext4: Fix typos
  2014-01-16  1:24 [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Matthew Wilcox
                   ` (18 preceding siblings ...)
  2014-01-16  1:24 ` [PATCH v5 19/22] ext4: Add XIP functionality Matthew Wilcox
@ 2014-01-16  1:24 ` Matthew Wilcox
  2014-01-16  1:24 ` [PATCH v5 21/22] xip: Add reporting of major faults Matthew Wilcox
                   ` (8 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2014-01-16  1:24 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-ext4; +Cc: Matthew Wilcox

Comment fix only

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/ext4/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 8b73d77..cae40af 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3720,7 +3720,7 @@ void ext4_truncate(struct inode *inode)
 
 	/*
 	 * There is a possibility that we're either freeing the inode
-	 * or it completely new indode. In those cases we might not
+	 * or it's a completely new inode. In those cases we might not
 	 * have i_mutex locked because it's not necessary.
 	 */
 	if (!(inode->i_state & (I_NEW|I_FREEING)))
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 21/22] xip: Add reporting of major faults
  2014-01-16  1:24 [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Matthew Wilcox
                   ` (19 preceding siblings ...)
  2014-01-16  1:24 ` [PATCH v5 20/22] ext4: Fix typos Matthew Wilcox
@ 2014-01-16  1:24 ` Matthew Wilcox
  2014-01-16  1:24 ` [PATCH v5 22/22] XIP: Add support for unwritten extents Matthew Wilcox
                   ` (7 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2014-01-16  1:24 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-ext4; +Cc: Matthew Wilcox

If we have to call get_block with the create argument set to 1, then
the filesystem almost certainly had to zero the block. which is an I/O,
which should be reported as a major fault.

Note that major faults on XIP files happen for different reasons than
major faults on non-XIP files.  XIP files behave as if everything except
file holes is already cached.  That's all the more reason to report
major faults when we do have to do I/O; it may be a valuable resource
for sysadmins trying to diagnose performance problems.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 fs/xip.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/xip.c b/fs/xip.c
index 9087e0f..88a516b 100644
--- a/fs/xip.c
+++ b/fs/xip.c
@@ -19,10 +19,12 @@
 #include <linux/fs.h>
 #include <linux/genhd.h>
 #include <linux/highmem.h>
+#include <linux/memcontrol.h>
 #include <linux/mm.h>
 #include <linux/mutex.h>
 #include <linux/sched.h>
 #include <linux/uio.h>
+#include <linux/vmstat.h>
 
 int xip_clear_blocks(struct inode *inode, sector_t block, long size)
 {
@@ -250,6 +252,7 @@ static int do_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	pgoff_t size;
 	unsigned long pfn;
 	int error;
+	int major = 0;
 
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
@@ -265,6 +268,9 @@ static int do_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (!buffer_mapped(&bh) && !vmf->cow_page) {
 		if (vmf->flags & FAULT_FLAG_WRITE) {
 			error = get_block(inode, block, &bh, 1);
+			count_vm_event(PGMAJFAULT);
+			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
+			major = VM_FAULT_MAJOR;
 			if (error || bh.b_size < PAGE_SIZE)
 				return VM_FAULT_SIGBUS;
 		} else {
@@ -296,7 +302,7 @@ static int do_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	/* -EBUSY is fine, somebody else faulted on the same PTE */
 	if (error != -EBUSY)
 		BUG_ON(error);
-	return VM_FAULT_NOPAGE;
+	return VM_FAULT_NOPAGE | major;
 }
 
 /**
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 22/22] XIP: Add support for unwritten extents
  2014-01-16  1:24 [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Matthew Wilcox
                   ` (20 preceding siblings ...)
  2014-01-16  1:24 ` [PATCH v5 21/22] xip: Add reporting of major faults Matthew Wilcox
@ 2014-01-16  1:24 ` Matthew Wilcox
       [not found] ` <CEFDA737.22F87%matthew.r.wilcox@intel.com>
                   ` (6 subsequent siblings)
  28 siblings, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2014-01-16  1:24 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm, linux-ext4; +Cc: Matthew Wilcox

For read() and pagefault, we treat unwritten extents as holes.
For write(), we have to zero parts of the block that we're not going to
write to.  For holepunches, something's gone quite strangely wrong if we
get an unwritten extent from get_block, considering that the filesystem's
calling us to write zeroes to a partially written extent ...

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 Documentation/filesystems/xip.txt |  7 +++----
 fs/xip.c                          | 44 +++++++++++++++++++++++++++------------
 2 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
index b158de6..5e9a0c76 100644
--- a/Documentation/filesystems/xip.txt
+++ b/Documentation/filesystems/xip.txt
@@ -60,10 +60,9 @@ Filesystem support consists of
   truncates and page faults
 
 The get_block() callback passed to xip_do_io(), xip_fault(), xip_mkwrite()
-and xip_truncate_page() must not return uninitialised extents.  It must zero
-any blocks that it returns, and it must ensure that simultaneous calls to
-get_block() (for example by a page-fault racing with a read() or a write())
-work correctly.
+and xip_truncate_page() may return uninitialised extents.  If it does, it
+must ensure that simultaneous calls to get_block() (for example by a
+page-fault racing with a read() or a write()) work correctly.
 
 These filesystems may be used for inspiration:
 - ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
diff --git a/fs/xip.c b/fs/xip.c
index 88a516b..d160320 100644
--- a/fs/xip.c
+++ b/fs/xip.c
@@ -79,6 +79,12 @@ static long xip_get_pfn(struct inode *inode, struct buffer_head *bh,
 	return ops->direct_access(bdev, sector, &addr, pfn, bh->b_size);
 }
 
+/* true if a buffer_head represents written data */
+static bool buffer_written(struct buffer_head *bh)
+{
+	return buffer_mapped(bh) && !buffer_unwritten(bh);
+}
+
 static ssize_t xip_io(int rw, struct inode *inode, const struct iovec *iov,
 			loff_t start, loff_t end, unsigned nr_segs,
 			get_block_t get_block, struct buffer_head *bh)
@@ -103,21 +109,29 @@ static ssize_t xip_io(int rw, struct inode *inode, const struct iovec *iov,
 			retval = get_block(inode, block, bh, rw == WRITE);
 			if (retval)
 				break;
-			if (buffer_mapped(bh)) {
-				retval = xip_get_addr(inode, bh, &addr);
-				if (retval < 0)
-					break;
-				addr += offset - (block << inode->i_blkbits);
-				hole = false;
-				size = retval;
-			} else {
-				if (rw == WRITE) {
+			if (rw == WRITE) {
+				if (!buffer_mapped(bh)) {
 					retval = -EIO;
 					break;
 				}
+				hole = false;
+			} else {
+				hole = !buffer_written(bh);
+			}
+
+			if (hole) {
 				addr = NULL;
-				hole = true;
 				size = bh->b_size;
+			} else {
+				unsigned first;
+				retval = xip_get_addr(inode, bh, &addr);
+				if (retval < 0)
+					break;
+				size = retval;
+				first = offset - (block << inode->i_blkbits);
+				if (buffer_unwritten(bh))
+					memset(addr, 0, first);
+				addr += first;
 			}
 			max = offset + size;
 		}
@@ -265,7 +279,7 @@ static int do_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (error || bh.b_size < PAGE_SIZE)
 		return VM_FAULT_SIGBUS;
 
-	if (!buffer_mapped(&bh) && !vmf->cow_page) {
+	if (!buffer_written(&bh) && !vmf->cow_page) {
 		if (vmf->flags & FAULT_FLAG_WRITE) {
 			error = get_block(inode, block, &bh, 1);
 			count_vm_event(PGMAJFAULT);
@@ -286,7 +300,7 @@ static int do_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		return VM_FAULT_SIGBUS;
 	}
 	if (vmf->cow_page) {
-		if (buffer_mapped(&bh))
+		if (buffer_written(&bh))
 			copy_user_bh(vmf->cow_page, inode, &bh, vaddr);
 		else
 			clear_user_highpage(vmf->cow_page, vaddr);
@@ -397,7 +411,11 @@ int xip_zero_page_range(struct inode *inode, loff_t from, unsigned length,
 	err = get_block(inode, index, &bh, 0);
 	if (err < 0)
 		return err;
-	if (buffer_mapped(&bh)) {
+	if (buffer_written(&bh)) {
+		/*
+		 * Should this be BUG_ON(!buffer_mapped)?  Surely we should
+		 * never be called for an unmapped block ...
+		 */
 		void *addr;
 		err = xip_get_addr(inode, &bh, &addr);
 		if (err)
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 10/22] Remove get_xip_mem
  2014-01-16  1:24 ` [PATCH v5 10/22] Remove get_xip_mem Matthew Wilcox
@ 2014-01-16  1:46   ` Randy Dunlap
  2014-01-27 13:26     ` Matthew Wilcox
  0 siblings, 1 reply; 45+ messages in thread
From: Randy Dunlap @ 2014-01-16  1:46 UTC (permalink / raw)
  To: Matthew Wilcox, linux-kernel, linux-fsdevel, linux-mm, linux-ext4

On 01/15/2014 05:24 PM, Matthew Wilcox wrote:
> All callers of get_xip_mem() are now gone.  Remove checks for it,
> initialisers of it, documentation of it and the only implementation of it.
> 
> Add documentation for the new way of writing an XIP filesystem.
> 
> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
> ---
>  Documentation/filesystems/Locking |   3 -
>  Documentation/filesystems/xip.txt | 116 +++++++++++++++++++++-----------------
>  fs/exofs/inode.c                  |   1 -
>  fs/ext2/inode.c                   |   1 -
>  fs/ext2/xip.c                     |  37 ------------
>  fs/ext2/xip.h                     |   3 -
>  fs/open.c                         |   5 +-
>  include/linux/fs.h                |   2 -
>  mm/fadvise.c                      |   6 +-
>  mm/madvise.c                      |   2 +-
>  10 files changed, 70 insertions(+), 106 deletions(-)
> 
> diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
> index b62eabf..520e73a 100644
> --- a/Documentation/filesystems/xip.txt
> +++ b/Documentation/filesystems/xip.txt
> @@ -3,69 +3,81 @@ Execute-in-place for file mappings
>  
>  Motivation
>  ----------
> -File mappings are performed by mapping page cache pages to userspace. In
> -addition, read&write type file operations also transfer data from/to the page
> -cache.
> -
> -For memory backed storage devices that use the block device interface, the page
> -cache pages are in fact copies of the original storage. Various approaches
> -exist to work around the need for an extra copy. The ramdisk driver for example
> -does read the data into the page cache, keeps a reference, and discards the
> -original data behind later on.
> -
> -Execute-in-place solves this issue the other way around: instead of keeping
> -data in the page cache, the need to have a page cache copy is eliminated
> -completely. With execute-in-place, read&write type operations are performed
> -directly from/to the memory backed storage device. For file mappings, the
> -storage device itself is mapped directly into userspace.
> -
> -This implementation was initially written for shared memory segments between
> -different virtual machines on s390 hardware to allow multiple machines to
> -share the same binaries and libraries.
> -
> -Implementation
> ---------------
> -Execute-in-place is implemented in three steps: block device operation,
> -address space operation, and file operations.
> -
> -A block device operation named direct_access is used to translate the
> -block device sector number to a page frame number (pfn) that identifies
> -the physical page for the memory.  It also returns a kernel virtual
> -address that can be used to access the memory.
> +
> +File mappings are usually performed by mapping page cache pages to
> +userspace.  In addition, read & write file operations also transfer data
> +between the page cache and storage.
> +
> +For memory backed storage devices that use the block device interface,
> +the page cache pages are just copies of the original storage.  The
> +execute-in-place code removes the extra copy by performing reads and
> +writes directly on the memory backed storage device.  For file mappings,
> +the storage device itself is mapped directly into userspace.
> +
> +
> +Implementation Tips for Block Driver Writers
> +--------------------------------------------
> +
> +To support XIP in your block driver, implement the 'direct_access'
> +block device operation.  It is used to translate the sector number
> +(expressed in units of 512-byte sectors) to a page frame number (pfn)
> +that identifies the physical page for the memory.  It also returns a
> +kernel virtual address that can be used to access the memory.
>  
>  The direct_access method takes a 'size' parameter that indicates the
>  number of bytes being requested.  The function should return the number
>  of bytes that it can provide, although it must not exceed the number of
>  bytes requested.  It may also return a negative errno if an error occurs.
>  
> -The block device operation is optional, these block devices support it as of
> -today:
> +In order to support this method, the storage must be byte-accessable by

                                                        byte-accessible

> +the CPU at all times.  If your device uses paging techniques to expose
> +a large amount of memory through a smaller window, then you cannot
> +implement direct_access.  Equally, if your device can occasionally
> +stall the CPU for an extended period, you should also not attempt to
> +implement direct_access.
> +
> +These block devices may be used for inspiration:
> +- axonram: Axon DDR2 device driver
> +- brd: RAM backed block device driver
>  - dcssblk: s390 dcss block device driver
>  
> -An address space operation named get_xip_mem is used to retrieve references
> -to a page frame number and a kernel address. To obtain these values a reference
> -to an address_space is provided. This function assigns values to the kmem and
> -pfn parameters. The third argument indicates whether the function should allocate
> -blocks if needed.
>  
> -This address space operation is mutually exclusive with readpage&writepage that
> -do page cache read/write operations.
> -The following filesystems support it as of today:
> -- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
> +Implementation Tips for Filesystem Writers
> +------------------------------------------
> +
> +Filesystem support consists of
> +- adding support to mark inodes as being XIP by setting the S_XIP flag in
> +  i_flags
> +- implementing the direct_IO address space operation, and calling
> +  xip_do_io() instead of blockdev_direct_IO() if S_XIP is set
> +- implementing an mmap file operation for XIP files which sets the
> +  VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers
> +  for fault and page_mkwrite (which should probably call xip_fault() and
> +  xip_mkwrite(), passing the appropriate get_block() callback)
> +- calling xip_truncate_page() instead of block_truncate_page() for XIP files
> +- ensuring that there is sufficient locking between reads, writes,
> +  truncates and page faults

     truncates, and
but that's up to you and your editor/proofreader etc.  :)

> +
> +The get_block() callback passed to xip_do_io(), xip_fault(), xip_mkwrite()
> +and xip_truncate_page() must not return uninitialised extents.  It must zero
> +any blocks that it returns, and it must ensure that simultaneous calls to
> +get_block() (for example by a page-fault racing with a read() or a write())
> +work correctly.
>  
> -A set of file operations that do utilize get_xip_page can be found in
> -mm/filemap_xip.c . The following file operation implementations are provided:
> -- aio_read/aio_write
> -- readv/writev
> -- sendfile
> +These filesystems may be used for inspiration:
> +- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
>  
> -The generic file operations do_sync_read/do_sync_write can be used to implement
> -classic synchronous IO calls.
>  
>  Shortcomings
>  ------------
> -This implementation is limited to storage devices that are cpu addressable at
> -all times (no highmem or such). It works well on rom/ram, but enhancements are
> -needed to make it work with flash in read+write mode.
> -Putting the Linux kernel and/or its modules on a xip filesystem does not mean
> -they are not copied.
> +
> +Even if the kernel or its modules are stored on an filesystem that supports

                                                   a

> +XIP on a block device that supports XIP, they will still be copied into RAM.
> +
> +Calling get_user_pages() on a range of user memory that has been mmaped
> +from an XIP file will fail as there are no 'struct page' to describe
> +those pages.  This problem is being worked on.  That means that O_DIRECT
> +reads/writes to those memory ranges from a non-XIP file will fail (note
> +that O_DIRECT reads/writes _of an XIP file_ do work, it is the memory
> +that is being accessed that is key here).  Other things that will not
> +work include RDMA, sendfile() and splice().

                      sendfile(),
same comment as above.



-- 
~Randy

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 19/22] ext4: Add XIP functionality
       [not found] ` <CEFDA737.22F87%matthew.r.wilcox@intel.com>
@ 2014-01-17  0:00   ` Ross Zwisler
  0 siblings, 0 replies; 45+ messages in thread
From: Ross Zwisler @ 2014-01-17  0:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, linux-fsdevel, linux-mm, linux-ext4, Ross Zwisler

On Wed, 15 Jan 2014, Matthew Wilcox wrote:

> +#ifdef CONFIG_FS_XIP
> +const struct file_operations ext4_xip_file_operations = {
> +	.llseek		= ext4_llseek,
> +	.read		= do_sync_read,
> +	.write		= do_sync_write,

I think we may always need to define ext2_xip_file_operations and
ext4_xip_file_operations, even if we have XIP compiled out.  We make the
decision on which file operations table to use at runtime:

from ext4_iget:
		if (test_opt(inode->i_sb, XIP))
                        inode->i_fop = &ext4_xip_file_operations;
                else    
                        inode->i_fop = &ext4_file_operations;

With CONFIG_FS_XIP undefined, we get a compile error:
	ERROR: "ext4_xip_file_operations" [fs/ext4/ext4.ko] undefined!
	ERROR: "ext2_xip_file_operations" [fs/ext2/ext2.ko] undefined!

My guess is that with the old ext2 XIP code and with the first pass of the ext4
XIP code, we weren't seeing this because the uses of the xip file operations
table were optimized out, removing the undefined symbol?

- Ross

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 22/22] XIP: Add support for unwritten extents
       [not found] ` <CEFD7DAD.22F65%matthew.r.wilcox@intel.com>
@ 2014-01-22 22:51   ` Ross Zwisler
  2014-01-23 12:08     ` Matthew Wilcox
       [not found]     ` <CF0C370C.235F1%willy@linux.intel.com>
  0 siblings, 2 replies; 45+ messages in thread
From: Ross Zwisler @ 2014-01-22 22:51 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-kernel, linux-fsdevel, linux-mm, linux-ext4

On Wed, 15 Jan 2014, Matthew Wilcox wrote:

>  static ssize_t xip_io(int rw, struct inode *inode, const struct iovec
> *iov,
>  			loff_t start, loff_t end, unsigned nr_segs,
>  			get_block_t get_block, struct buffer_head *bh)
> @@ -103,21 +109,29 @@ static ssize_t xip_io(int rw, struct inode *inode,
> const struct iovec *iov,
>  			retval = get_block(inode, block, bh, rw == WRITE);
>  			if (retval)
>  				break;
> -			if (buffer_mapped(bh)) {
> -				retval = xip_get_addr(inode, bh, &addr);
> -				if (retval < 0)
> -					break;
> -				addr += offset - (block << inode->i_blkbits);
> -				hole = false;
> -				size = retval;
> -			} else {
> -				if (rw == WRITE) {
> +			if (rw == WRITE) {
> +				if (!buffer_mapped(bh)) {
>  					retval = -EIO;
>  					break;
>  				}
> +				hole = false;
> +			} else {
> +				hole = !buffer_written(bh);
> +			}
> +
> +			if (hole) {
>  				addr = NULL;
> -				hole = true;
>  				size = bh->b_size;
> +			} else {
> +				unsigned first;
> +				retval = xip_get_addr(inode, bh, &addr);
> +				if (retval < 0)
> +					break;
> +				size = retval;
> +				first = offset - (block << inode->i_blkbits);
> +				if (buffer_unwritten(bh))
> +					memset(addr, 0, first);
> +				addr += first;

+                               size -= first;

This is needed so that we don't overrun the XIP buffer we are given in the
event that our user buffer >= our XIP buffer and the start of our I/O isn't
block aligned.

You can add my 
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> 

Thanks,
- Ross

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4
  2014-01-16  1:24 [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Matthew Wilcox
                   ` (23 preceding siblings ...)
       [not found] ` <CEFD7DAD.22F65%matthew.r.wilcox@intel.com>
@ 2014-01-23  7:48 ` Dave Chinner
  2014-01-23  7:53   ` Dave Chinner
  2014-01-23  9:01 ` Dave Chinner
                   ` (3 subsequent siblings)
  28 siblings, 1 reply; 45+ messages in thread
From: Dave Chinner @ 2014-01-23  7:48 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-kernel, linux-fsdevel, linux-mm, linux-ext4

On Wed, Jan 15, 2014 at 08:24:18PM -0500, Matthew Wilcox wrote:
> This series of patches add support for XIP to ext4.  Unfortunately,
> it turns out to be necessary to rewrite the existing XIP support code
> first due to races that are unfixable in the current design.
> 
> Since v4 of this patchset, I've improved the documentation, fixed a
> couple of warnings that a newer version of gcc emitted, and fixed a
> bug where we would read/write the wrong address for I/Os that were not
> aligned to PAGE_SIZE.
> 
> I've dropped the PMD fault patch from this set since there are some
> places where we would need to split a PMD page and there's no way to do
> that right now.  In its place, I've added a patch which attempts to add
> support for unwritten extents.  I'm still in two minds about this; on the
> one hand, it's clearly a win for reads and writes.  On the other hand,
> it adds a lot of complexity, and it probably isn't a win for pagefaults.

FYI, this may just be pure coincidence, but shortly after the first
boot of a machine with this patchset on 3.13 the root *ext3*
filesystem started having problems.  It now gives persistent ENOSPC
errors when there's 2.3GB of space free (on a 8GB partition), even
though e2fsck says the filesystem is clean and error free.

Fmeh.

Update: I've just removed the patchset, rebuilt the kernel and the
ENOSPC problem is still there. So it may be co-incidence, but given
that it is persistent something is screwed got screwed up in the
filesytem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4
  2014-01-23  7:48 ` [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Dave Chinner
@ 2014-01-23  7:53   ` Dave Chinner
  0 siblings, 0 replies; 45+ messages in thread
From: Dave Chinner @ 2014-01-23  7:53 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-kernel, linux-fsdevel, linux-mm, linux-ext4

On Thu, Jan 23, 2014 at 06:48:25PM +1100, Dave Chinner wrote:
> On Wed, Jan 15, 2014 at 08:24:18PM -0500, Matthew Wilcox wrote:
> > This series of patches add support for XIP to ext4.  Unfortunately,
> > it turns out to be necessary to rewrite the existing XIP support code
> > first due to races that are unfixable in the current design.
> > 
> > Since v4 of this patchset, I've improved the documentation, fixed a
> > couple of warnings that a newer version of gcc emitted, and fixed a
> > bug where we would read/write the wrong address for I/Os that were not
> > aligned to PAGE_SIZE.
> > 
> > I've dropped the PMD fault patch from this set since there are some
> > places where we would need to split a PMD page and there's no way to do
> > that right now.  In its place, I've added a patch which attempts to add
> > support for unwritten extents.  I'm still in two minds about this; on the
> > one hand, it's clearly a win for reads and writes.  On the other hand,
> > it adds a lot of complexity, and it probably isn't a win for pagefaults.
> 
> FYI, this may just be pure coincidence, but shortly after the first
> boot of a machine with this patchset on 3.13 the root *ext3*
> filesystem started having problems.  It now gives persistent ENOSPC
> errors when there's 2.3GB of space free (on a 8GB partition), even
> though e2fsck says the filesystem is clean and error free.
> 
> Fmeh.
> 
> Update: I've just removed the patchset, rebuilt the kernel and the
> ENOSPC problem is still there. So it may be co-incidence, but given
> that it is persistent something is screwed got screwed up in the
> filesytem.

OK, false alarm - it is co-incidence. The damn root filesystem ran
out of inodes. Can you beleive that you're only allowed 600k inodes
in an 8GB filesystems? Sheesh! :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4
  2014-01-16  1:24 [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Matthew Wilcox
                   ` (24 preceding siblings ...)
  2014-01-23  7:48 ` [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Dave Chinner
@ 2014-01-23  9:01 ` Dave Chinner
  2014-01-23 12:12   ` Wilcox, Matthew R
  2014-01-30  6:42 ` Dave Chinner
                   ` (2 subsequent siblings)
  28 siblings, 1 reply; 45+ messages in thread
From: Dave Chinner @ 2014-01-23  9:01 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-kernel, linux-fsdevel, linux-mm, linux-ext4

On Wed, Jan 15, 2014 at 08:24:18PM -0500, Matthew Wilcox wrote:
> This series of patches add support for XIP to ext4.  Unfortunately,
> it turns out to be necessary to rewrite the existing XIP support code
> first due to races that are unfixable in the current design.
> 
> Since v4 of this patchset, I've improved the documentation, fixed a
> couple of warnings that a newer version of gcc emitted, and fixed a
> bug where we would read/write the wrong address for I/Os that were not
> aligned to PAGE_SIZE.
> 
> I've dropped the PMD fault patch from this set since there are some
> places where we would need to split a PMD page and there's no way to do
> that right now.  In its place, I've added a patch which attempts to add
> support for unwritten extents.  I'm still in two minds about this; on the
> one hand, it's clearly a win for reads and writes.  On the other hand,
> it adds a lot of complexity, and it probably isn't a win for pagefaults.

I ran this through xfstests, but ext4 in default configuration fails
too many of the tests with filesystem corruption and other cascading
failures on the quick group tests (generic/013, generic/070,
generic/075, generic/091, etc)  for me to be able to tell if adding
MOUNT_OPTIONS="-o xip" adds any problems or not....

XIP definitely caused generic/001 to fail, but other than that I
can't really tell. Still, it looks like it functions enough to be
able to add XFS support on top of. I'll get back to you with that ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 22/22] XIP: Add support for unwritten extents
  2014-01-22 22:51   ` [PATCH v5 22/22] XIP: Add support for unwritten extents Ross Zwisler
@ 2014-01-23 12:08     ` Matthew Wilcox
  2014-01-23 19:13       ` Ross Zwisler
       [not found]     ` <CF0C370C.235F1%willy@linux.intel.com>
  1 sibling, 1 reply; 45+ messages in thread
From: Matthew Wilcox @ 2014-01-23 12:08 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Matthew Wilcox, linux-kernel, linux-fsdevel, linux-mm, linux-ext4

On Wed, Jan 22, 2014 at 03:51:56PM -0700, Ross Zwisler wrote:
> > +			if (hole) {
> >  				addr = NULL;
> > -				hole = true;
> >  				size = bh->b_size;
> > +			} else {
> > +				unsigned first;
> > +				retval = xip_get_addr(inode, bh, &addr);
> > +				if (retval < 0)
> > +					break;
> > +				size = retval;
> > +				first = offset - (block << inode->i_blkbits);
> > +				if (buffer_unwritten(bh))
> > +					memset(addr, 0, first);
> > +				addr += first;
> 
> +                               size -= first;
> 
> This is needed so that we don't overrun the XIP buffer we are given in the
> event that our user buffer >= our XIP buffer and the start of our I/O isn't
> block aligned.

You're right!  Thank you!  However, we also need it for the hole ==
true case, don't we?  So maybe something like this, incrementally on top of
patch 22/22:

P.S. Can someone come up with a better name for this variable than 'first'?
I'd usually use 'offset', but that's already taken.  'annoying_bit' seems a
bit judgemental.  'misaligned', maybe?  'skip' or 'seek' like dd uses?

diff --git a/fs/xip.c b/fs/xip.c
index 92157ff..1ae00db 100644
--- a/fs/xip.c
+++ b/fs/xip.c
@@ -103,6 +103,7 @@ static ssize_t xip_io(int rw, struct inode *inode, const struct iovec *iov,
 
 		if (max == offset) {
 			sector_t block = offset >> inode->i_blkbits;
+			unsigned first = offset - (block << inode->i_blkbits);
 			long size;
 			memset(bh, 0, sizeof(*bh));
 			bh->b_size = ALIGN(end - offset, PAGE_SIZE);
@@ -121,14 +122,12 @@ static ssize_t xip_io(int rw, struct inode *inode, const struct iovec *iov,
 
 			if (hole) {
 				addr = NULL;
-				size = bh->b_size;
+				size = bh->b_size - first;
 			} else {
-				unsigned first;
 				retval = xip_get_addr(inode, bh, &addr);
 				if (retval < 0)
 					break;
-				size = retval;
-				first = offset - (block << inode->i_blkbits);
+				size = retval - first;
 				if (buffer_unwritten(bh))
 					memset(addr, 0, first);
 				addr += first;

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* RE: [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4
  2014-01-23  9:01 ` Dave Chinner
@ 2014-01-23 12:12   ` Wilcox, Matthew R
  2014-01-28  6:06     ` Dave Chinner
  0 siblings, 1 reply; 45+ messages in thread
From: Wilcox, Matthew R @ 2014-01-23 12:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, linux-ext4

Are you hitting the same problems with ext4 fsck that we did?  Version 1.42.8 reports spurious corruption.  From the 1.42.9 changelog:

  * Fixed a regression introduced in 1.42.8 which would cause e2fsck to
    erroneously report uninitialized extents past i_size to be invalid.

________________________________________
From: Dave Chinner [david@fromorbit.com]
Sent: January 23, 2014 1:01 AM
To: Wilcox, Matthew R
Cc: linux-kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-mm@kvack.org; linux-ext4@vger.kernel.org
Subject: Re: [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4

On Wed, Jan 15, 2014 at 08:24:18PM -0500, Matthew Wilcox wrote:
> This series of patches add support for XIP to ext4.  Unfortunately,
> it turns out to be necessary to rewrite the existing XIP support code
> first due to races that are unfixable in the current design.
>
> Since v4 of this patchset, I've improved the documentation, fixed a
> couple of warnings that a newer version of gcc emitted, and fixed a
> bug where we would read/write the wrong address for I/Os that were not
> aligned to PAGE_SIZE.
>
> I've dropped the PMD fault patch from this set since there are some
> places where we would need to split a PMD page and there's no way to do
> that right now.  In its place, I've added a patch which attempts to add
> support for unwritten extents.  I'm still in two minds about this; on the
> one hand, it's clearly a win for reads and writes.  On the other hand,
> it adds a lot of complexity, and it probably isn't a win for pagefaults.

I ran this through xfstests, but ext4 in default configuration fails
too many of the tests with filesystem corruption and other cascading
failures on the quick group tests (generic/013, generic/070,
generic/075, generic/091, etc)  for me to be able to tell if adding
MOUNT_OPTIONS="-o xip" adds any problems or not....

XIP definitely caused generic/001 to fail, but other than that I
can't really tell. Still, it looks like it functions enough to be
able to add XFS support on top of. I'll get back to you with that ;)

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 22/22] XIP: Add support for unwritten extents
  2014-01-23 12:08     ` Matthew Wilcox
@ 2014-01-23 19:13       ` Ross Zwisler
  0 siblings, 0 replies; 45+ messages in thread
From: Ross Zwisler @ 2014-01-23 19:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ross Zwisler, Matthew Wilcox, linux-kernel, linux-fsdevel,
	linux-mm, linux-ext4

On Thu, 23 Jan 2014, Matthew Wilcox wrote:
> On Wed, Jan 22, 2014 at 03:51:56PM -0700, Ross Zwisler wrote:
> > > +			if (hole) {
> > >  				addr = NULL;
> > > -				hole = true;
> > >  				size = bh->b_size;
> > > +			} else {
> > > +				unsigned first;
> > > +				retval = xip_get_addr(inode, bh, &addr);
> > > +				if (retval < 0)
> > > +					break;
> > > +				size = retval;
> > > +				first = offset - (block << inode->i_blkbits);
> > > +				if (buffer_unwritten(bh))
> > > +					memset(addr, 0, first);
> > > +				addr += first;
> > 
> > +                               size -= first;
> > 
> > This is needed so that we don't overrun the XIP buffer we are given in the
> > event that our user buffer >= our XIP buffer and the start of our I/O isn't
> > block aligned.
> 
> You're right!  Thank you!  However, we also need it for the hole ==
> true case, don't we?  So maybe something like this, incrementally on top of
> patch 22/22:
> 
> P.S. Can someone come up with a better name for this variable than 'first'?
> I'd usually use 'offset', but that's already taken.  'annoying_bit' seems a
> bit judgemental.  'misaligned', maybe?  'skip' or 'seek' like dd uses?
> 
> diff --git a/fs/xip.c b/fs/xip.c
> index 92157ff..1ae00db 100644
> --- a/fs/xip.c
> +++ b/fs/xip.c
> @@ -103,6 +103,7 @@ static ssize_t xip_io(int rw, struct inode *inode, const struct iovec *iov,
>  
>  		if (max == offset) {
>  			sector_t block = offset >> inode->i_blkbits;
> +			unsigned first = offset - (block << inode->i_blkbits);
>  			long size;
>  			memset(bh, 0, sizeof(*bh));
>  			bh->b_size = ALIGN(end - offset, PAGE_SIZE);
> @@ -121,14 +122,12 @@ static ssize_t xip_io(int rw, struct inode *inode, const struct iovec *iov,
>  
>  			if (hole) {
>  				addr = NULL;
> -				size = bh->b_size;
> +				size = bh->b_size - first;
>  			} else {
> -				unsigned first;
>  				retval = xip_get_addr(inode, bh, &addr);
>  				if (retval < 0)
>  					break;
> -				size = retval;
> -				first = offset - (block << inode->i_blkbits);
> +				size = retval - first;
>  				if (buffer_unwritten(bh))
>  					memset(addr, 0, first);
>  				addr += first;

Yep, this seems right to me.

Maybe "misalignment"?  Seems more descriptive (if a bit long), but I don't
know if there are other, better existing conventions.

- Ross

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 10/22] Remove get_xip_mem
  2014-01-16  1:46   ` Randy Dunlap
@ 2014-01-27 13:26     ` Matthew Wilcox
  0 siblings, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2014-01-27 13:26 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Matthew Wilcox, linux-kernel, linux-fsdevel, linux-mm, linux-ext4

On Wed, Jan 15, 2014 at 05:46:28PM -0800, Randy Dunlap wrote:
> > +In order to support this method, the storage must be byte-accessable by
> 
>                                                         byte-accessible

Thanks.  Fixed.

> > +- ensuring that there is sufficient locking between reads, writes,
> > +  truncates and page faults
> 
>      truncates, and
> but that's up to you and your editor/proofreader etc.  :)

Ooh, do we really get to have a discussion about the Oxford Comma on
linux-kernel?  :-)

I haven't actually run this material past my editor (who is my wife, so
I need really convincing arguments to do it your way instead of hers),
but funnily we had a conversation about the Oxford Comma while on holiday
last week.  http://en.wikipedia.org/wiki/Serial_comma has a reasonably
exhaustive discourse on the subject, and I learned that she's probably
primarily against it because of her journalism degree.

> > +Even if the kernel or its modules are stored on an filesystem that supports
> 
>                                                    a

Good catch.  I think I started out with 'an fs', then expanded it to
"an filesystem" which of course is nonsense.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 22/22] XIP: Add support for unwritten extents
       [not found]     ` <CF0C370C.235F1%willy@linux.intel.com>
@ 2014-01-27 23:32       ` Ross Zwisler
  2014-01-28  3:49         ` Matthew Wilcox
  0 siblings, 1 reply; 45+ messages in thread
From: Ross Zwisler @ 2014-01-27 23:32 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ross Zwisler, Matthew Wilcox, linux-kernel, linux-fsdevel,
	linux-mm, linux-ext4

On Thu, 23 Jan 2014, Matthew Wilcox wrote:
> On Wed, Jan 22, 2014 at 03:51:56PM -0700, Ross Zwisler wrote:
> > > +			if (hole) {
> > >  				addr = NULL;
> > > -				hole = true;
> > >  				size = bh->b_size;
> > > +			} else {
> > > +				unsigned first;
> > > +				retval = xip_get_addr(inode, bh, &addr);
> > > +				if (retval < 0)
> > > +					break;
> > > +				size = retval;
> > > +				first = offset - (block << inode->i_blkbits);
> > > +				if (buffer_unwritten(bh))
> > > +					memset(addr, 0, first);
> > > +				addr += first;
> > 
> > +                               size -= first;
> > 
> > This is needed so that we don't overrun the XIP buffer we are given in the
> > event that our user buffer >= our XIP buffer and the start of our I/O isn't
> > block aligned.
> 
> You're right!  Thank you!  However, we also need it for the hole ==
> true case, don't we?  So maybe something like this, incrementally on top of
> patch 22/22:
> 
> P.S. Can someone come up with a better name for this variable than 'first'?
> I'd usually use 'offset', but that's already taken.  'annoying_bit' seems a
> bit judgemental.  'misaligned', maybe?  'skip' or 'seek' like dd uses?
> 
> diff --git a/fs/xip.c b/fs/xip.c
> index 92157ff..1ae00db 100644
> --- a/fs/xip.c
> +++ b/fs/xip.c
> @@ -103,6 +103,7 @@ static ssize_t xip_io(int rw, struct inode *inode, const
> struct iovec *iov,
>  
>  		if (max == offset) {
>  			sector_t block = offset >> inode->i_blkbits;
> +			unsigned first = offset - (block << inode->i_blkbits);
>  			long size;
>  			memset(bh, 0, sizeof(*bh));
>  			bh->b_size = ALIGN(end - offset, PAGE_SIZE);
> @@ -121,14 +122,12 @@ static ssize_t xip_io(int rw, struct inode *inode,
> const struct iovec *iov,
>  
>  			if (hole) {
>  				addr = NULL;
> -				size = bh->b_size;
> +				size = bh->b_size - first;

It looks like we have an additional bit of complexity with the hole case.  The
issue is that for holes, bh->b_size is just the full size of the write as set
earlier in the function:

                        bh->b_size = ALIGN(end - offset, PAGE_SIZE);

>From this code it seems like you hoped the call into get_block() would adjust
bh->b_size to the size of the hole, allowing you to zero just the hole space
in the user buffer.  It doesn't look like it does, though, at least for ext4.
In looking at the direct I/O case (do_direct_IO()), they deal with holes on a
per FS block basis, and don't ever look at bh->b_size once they've figured out
the buffer is unmapped.

The result of this is that when you get a read that starts at a hole but moves
into real data, the read will just see a hole and return data of all zeros.

To just assume the current FS block is a hole, we can do something like this:

diff --git a/fs/xip.c b/fs/xip.c
index 35e401e..e902593 100644
--- a/fs/xip.c
+++ b/fs/xip.c
@@ -122,7 +122,7 @@ static ssize_t xip_io(int rw, struct inode *inode, const struct
 
                        if (hole) {
                                addr = NULL;
-                               size = bh->b_size - first;
+                               size = (1 << inode->i_blkbits) - first;
                        } else {
                                retval = xip_get_addr(inode, bh, &addr);
                                if (retval < 0)


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 22/22] XIP: Add support for unwritten extents
  2014-01-27 23:32       ` Ross Zwisler
@ 2014-01-28  3:49         ` Matthew Wilcox
  0 siblings, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2014-01-28  3:49 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Matthew Wilcox, Matthew Wilcox, linux-kernel, linux-fsdevel,
	linux-mm, linux-ext4

On Mon, Jan 27, 2014 at 04:32:07PM -0700, Ross Zwisler wrote:
> It looks like we have an additional bit of complexity with the hole case.  The
> issue is that for holes, bh->b_size is just the full size of the write as set
> earlier in the function:
> 
>                         bh->b_size = ALIGN(end - offset, PAGE_SIZE);
> 
> >From this code it seems like you hoped the call into get_block() would adjust
> bh->b_size to the size of the hole, allowing you to zero just the hole space
> in the user buffer.  It doesn't look like it does, though, at least for ext4.

Argh.  I got confused.  ext4 *has* this information, it just doesn't
propagate it into the bh if it's a hole!  Should it?  The comments in
the direct IO code imply that it *may*, but doesn't *have* to.  What it's
doing now (not touching it) is probably wrong.

> To just assume the current FS block is a hole, we can do something like this:

Yes, this should fix things on an interim basis.  Bit inefficient,
but it'll work.

> diff --git a/fs/xip.c b/fs/xip.c
> index 35e401e..e902593 100644
> --- a/fs/xip.c
> +++ b/fs/xip.c
> @@ -122,7 +122,7 @@ static ssize_t xip_io(int rw, struct inode *inode, const struct
>  
>                         if (hole) {
>                                 addr = NULL;
> -                               size = bh->b_size - first;
> +                               size = (1 << inode->i_blkbits) - first;
>                         } else {
>                                 retval = xip_get_addr(inode, bh, &addr);
>                                 if (retval < 0)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4
  2014-01-23 12:12   ` Wilcox, Matthew R
@ 2014-01-28  6:06     ` Dave Chinner
  0 siblings, 0 replies; 45+ messages in thread
From: Dave Chinner @ 2014-01-28  6:06 UTC (permalink / raw)
  To: Wilcox, Matthew R; +Cc: linux-kernel, linux-fsdevel, linux-mm, linux-ext4

On Thu, Jan 23, 2014 at 12:12:43PM +0000, Wilcox, Matthew R wrote:
> Are you hitting the same problems with ext4 fsck that we did?  Version 1.42.8 reports spurious corruption.  From the 1.42.9 changelog:
> 
>   * Fixed a regression introduced in 1.42.8 which would cause e2fsck to
>     erroneously report uninitialized extents past i_size to be invalid.

Don't think so - I'm running 1.42.9 on my test VM.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4
  2014-01-16  1:24 [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Matthew Wilcox
                   ` (25 preceding siblings ...)
  2014-01-23  9:01 ` Dave Chinner
@ 2014-01-30  6:42 ` Dave Chinner
  2014-01-30  9:25   ` Dave Chinner
       [not found] ` <CF1FF3EB.24114%matthew.r.wilcox@intel.com>
       [not found] ` <CF215477.24281%matthew.r.wilcox@intel.com>
  28 siblings, 1 reply; 45+ messages in thread
From: Dave Chinner @ 2014-01-30  6:42 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-kernel, linux-fsdevel, linux-mm, linux-ext4

On Wed, Jan 15, 2014 at 08:24:18PM -0500, Matthew Wilcox wrote:
> This series of patches add support for XIP to ext4.  Unfortunately,
> it turns out to be necessary to rewrite the existing XIP support code
> first due to races that are unfixable in the current design.
> 
> Since v4 of this patchset, I've improved the documentation, fixed a
> couple of warnings that a newer version of gcc emitted, and fixed a
> bug where we would read/write the wrong address for I/Os that were not
> aligned to PAGE_SIZE.

Looks like there's something fundamentally broken with the patch set
as it stands. I get this same data corruption on both ext4 and XFS
with XIP using fsx. It's as basic as it gets - the first read after
a mmapped write fails to see the data written by mmap:

$ sudo mkfs.xfs -f /dev/ram0
meta-data=/dev/ram0              isize=256    agcount=4, agsize=256000 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=0
data     =                       bsize=4096   blocks=1024000, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
log      =internal log           bsize=4096   blocks=12800, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
$ sudo mount -o xip /dev/ram0 /mnt/scr
$ sudo chmod 777 /mnt/scr
$ ltp/fsx -d -N 1000 -S 0 /mnt/scr/fsx
Seed set to 3774
1 mapwrite      0x3db39 thru    0x3ffff (0x24c7 bytes)
2 mapread       0x2e947 thru    0x33163 (0x481d bytes)
3 read  0x2e836 thru    0x3cba1 (0xe36c bytes)
4 punch from 0x2e7 to 0x5c43, (0x595c bytes)
5 mapwrite      0xcaea thru     0x13ba9 (0x70c0 bytes)
6 punch from 0x31645 to 0x38d1d, (0x76d8 bytes)
7 falloc        from 0x24f92 to 0x2f2b7 (0xa325 bytes)
fallocating to largest ever: 0x171ac
8 falloc        from 0xbcf1 to 0x171ac (0xb4bb bytes)
9 read  0x126f thru     0x11136 (0xfec8 bytes)
READ BAD DATA: offset = 0x126f, size = 0xfec8, fname = /mnt/scr/fsx
OFFSET  GOOD    BAD     RANGE
0x caea 0x05f9  0x0000  0x    0
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
0x caeb 0xf905  0x0000  0x    1
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
0x caec 0x0599  0x0000  0x    2
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
0x caed 0x9905  0x0000  0x    3
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
0x caee 0x05e8  0x0000  0x    4
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
0x caef 0xe805  0x0000  0x    5
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
0x caf0 0x0580  0x0000  0x    6
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
0x caf1 0x8005  0x0000  0x    7
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
0x caf2 0x056c  0x0000  0x    8
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
0x caf3 0x6c05  0x0000  0x    9
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
0x caf4 0x05ad  0x0000  0x    a
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
0x caf5 0xad05  0x0000  0x    b
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
0x caf6 0x0539  0x0000  0x    c
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
0x caf7 0x3905  0x0000  0x    d
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
0x caf8 0x05db  0x0000  0x    e
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
0x caf9 0xdb05  0x0000  0x    f
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
LOG DUMP (9 total operations):
1(  1 mod 256): MAPWRITE 0x3db39 thru 0x3ffff   (0x24c7 bytes)
2(  2 mod 256): MAPREAD  0x2e947 thru 0x33163   (0x481d bytes)
3(  3 mod 256): READ     0x2e836 thru 0x3cba1   (0xe36c bytes)
4(  4 mod 256): PUNCH    0x2e7 thru 0x5c42      (0x595c bytes)
5(  5 mod 256): MAPWRITE 0xcaea thru 0x13ba9    (0x70c0 bytes)  ******WWWW
6(  6 mod 256): PUNCH    0x31645 thru 0x38d1c   (0x76d8 bytes)
7(  7 mod 256): FALLOC   0x24f92 thru 0x2f2b7   (0xa325 bytes) INTERIOR
8(  8 mod 256): FALLOC   0xbcf1 thru 0x171ac    (0xb4bb bytes) INTERIOR ******FFFF
9(  9 mod 256): READ     0x126f thru 0x11136    (0xfec8 bytes)  ***RRRR***
Correct content saved for comparison
(maybe hexdump "/mnt/scr/fsx" vs "/mnt/scr/fsx.fsxgood")

XFS gives a good indication that we aren't doing something correctly
w.r.t. mapped XIP writes, as trying to fiemap the file ASSERT fails
with a delayed allocation extent somewhere inside the file after a
sync. I shall keep digging.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4
  2014-01-30  6:42 ` Dave Chinner
@ 2014-01-30  9:25   ` Dave Chinner
  2014-01-31  3:06     ` Dave Chinner
  0 siblings, 1 reply; 45+ messages in thread
From: Dave Chinner @ 2014-01-30  9:25 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-kernel, linux-fsdevel, linux-mm, linux-ext4

On Thu, Jan 30, 2014 at 05:42:30PM +1100, Dave Chinner wrote:
> On Wed, Jan 15, 2014 at 08:24:18PM -0500, Matthew Wilcox wrote:
> > This series of patches add support for XIP to ext4.  Unfortunately,
> > it turns out to be necessary to rewrite the existing XIP support code
> > first due to races that are unfixable in the current design.
> > 
> > Since v4 of this patchset, I've improved the documentation, fixed a
> > couple of warnings that a newer version of gcc emitted, and fixed a
> > bug where we would read/write the wrong address for I/Os that were not
> > aligned to PAGE_SIZE.
> 
> Looks like there's something fundamentally broken with the patch set
> as it stands. I get this same data corruption on both ext4 and XFS
> with XIP using fsx. It's as basic as it gets - the first read after
> a mmapped write fails to see the data written by mmap:
> 
> $ sudo mkfs.xfs -f /dev/ram0
> meta-data=/dev/ram0              isize=256    agcount=4, agsize=256000 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=0
> data     =                       bsize=4096   blocks=1024000, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
> log      =internal log           bsize=4096   blocks=12800, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> $ sudo mount -o xip /dev/ram0 /mnt/scr
> $ sudo chmod 777 /mnt/scr
> $ ltp/fsx -d -N 1000 -S 0 /mnt/scr/fsx
....
> operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
> LOG DUMP (9 total operations):
> 1(  1 mod 256): MAPWRITE 0x3db39 thru 0x3ffff   (0x24c7 bytes)
> 2(  2 mod 256): MAPREAD  0x2e947 thru 0x33163   (0x481d bytes)
> 3(  3 mod 256): READ     0x2e836 thru 0x3cba1   (0xe36c bytes)
> 4(  4 mod 256): PUNCH    0x2e7 thru 0x5c42      (0x595c bytes)
> 5(  5 mod 256): MAPWRITE 0xcaea thru 0x13ba9    (0x70c0 bytes)  ******WWWW
> 6(  6 mod 256): PUNCH    0x31645 thru 0x38d1c   (0x76d8 bytes)
> 7(  7 mod 256): FALLOC   0x24f92 thru 0x2f2b7   (0xa325 bytes) INTERIOR
> 8(  8 mod 256): FALLOC   0xbcf1 thru 0x171ac    (0xb4bb bytes) INTERIOR ******FFFF
> 9(  9 mod 256): READ     0x126f thru 0x11136    (0xfec8 bytes)  ***RRRR***
> Correct content saved for comparison
> (maybe hexdump "/mnt/scr/fsx" vs "/mnt/scr/fsx.fsxgood")
> 
> XFS gives a good indication that we aren't doing something correctly
> w.r.t. mapped XIP writes, as trying to fiemap the file ASSERT fails
> with a delayed allocation extent somewhere inside the file after a
> sync. I shall keep digging.

Ok, I understand the XFS ASSERT failure, but I don't really
understand the reason for the read failure. XFS assert failed
because I was using the delayed allocation enabled xfs_get_blocks()
to xip_fault/xip_mkwrite, so it was creating a delalloc extent
rather than allocating blocks, and then not having any pages in the
page cache to write back to convert the delalloc extent. This
doesn't explain the zeros being read, though.

So I changed to use the direct IO version, and that leaves me with
an unwritten extent over the mapped write code. Why? Because there's
no IO completion being run from either xip_fault() or xip_mkwrite()
to zero the buffers and run IO completion to convert the extent to
written....

$ xfs_io -f -c "truncate 8k" -c "mmap 0 8k" -c "mwrite 0 4k" \
> -c "bmap -vp" -c "pread -v 0 8k" -c "bmap -vp" /mnt/scr/foo
....
/mnt/scr/foo:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
   0: [0..7]:          224..231          0 (224..231)           8 10000
   1: [8..15]:         hole                                     8
$

We're trying to do something that the get_block callback has never
supported.  I note that you added zeroing to ext4_map_blocks() when
an unwritten extent is found and call xip_clear_blocks() from there
to try and handle this within the allocation context without
actually making it obvious why it is necessary.

Essentially what we need get_blocks(create = 1) to do here is this:

	if (hole)
		transactionally allocate and zero block in requested region
	if (unwritten)
		transactionally convert to written and zero block
	if (written)
		map blocks

I think we can get away with this from a crash recovery perspective
because the zeroing of the blocks is synchronous and within the
allocation transaction. I'm implementing a new xfs_get_blocks_xip to
do keep this new behaviour "separate" from the direct IO path
semantics.

I also got rid of the read block map followed by the "create" block
map. Just a single call with create set appropriately for the caller
context is all that is required - the getblock call will do the
correct thing for allocation/conversion cases and if there's already
a block there it will just return the mapping....

<hack, hack>

OK, I've fixed something. The above xfs_io test returns the correct
data on read now, fsx still fails. I'll keep working on it in the
morning, and when I have something that works I'll post it....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4
  2014-01-30  9:25   ` Dave Chinner
@ 2014-01-31  3:06     ` Dave Chinner
  2014-01-31  5:45       ` Ross Zwisler
  0 siblings, 1 reply; 45+ messages in thread
From: Dave Chinner @ 2014-01-31  3:06 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-kernel, linux-fsdevel, linux-mm, linux-ext4

On Thu, Jan 30, 2014 at 08:25:37PM +1100, Dave Chinner wrote:
> On Thu, Jan 30, 2014 at 05:42:30PM +1100, Dave Chinner wrote:
> > On Wed, Jan 15, 2014 at 08:24:18PM -0500, Matthew Wilcox wrote:
> > > This series of patches add support for XIP to ext4.  Unfortunately,
> > > it turns out to be necessary to rewrite the existing XIP support code
> > > first due to races that are unfixable in the current design.
> > > 
> > > Since v4 of this patchset, I've improved the documentation, fixed a
> > > couple of warnings that a newer version of gcc emitted, and fixed a
> > > bug where we would read/write the wrong address for I/Os that were not
> > > aligned to PAGE_SIZE.
> > 
> > Looks like there's something fundamentally broken with the patch set
> > as it stands. I get this same data corruption on both ext4 and XFS
> > with XIP using fsx. It's as basic as it gets - the first read after
> > a mmapped write fails to see the data written by mmap:
> > 
> > $ sudo mkfs.xfs -f /dev/ram0
> > meta-data=/dev/ram0              isize=256    agcount=4, agsize=256000 blks
> >          =                       sectsz=512   attr=2, projid32bit=1
> >          =                       crc=0
> > data     =                       bsize=4096   blocks=1024000, imaxpct=25
> >          =                       sunit=0      swidth=0 blks
> > naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
> > log      =internal log           bsize=4096   blocks=12800, version=2
> >          =                       sectsz=512   sunit=0 blks, lazy-count=1
> > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > $ sudo mount -o xip /dev/ram0 /mnt/scr
> > $ sudo chmod 777 /mnt/scr
> > $ ltp/fsx -d -N 1000 -S 0 /mnt/scr/fsx
> ....
> > operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
> > LOG DUMP (9 total operations):
> > 1(  1 mod 256): MAPWRITE 0x3db39 thru 0x3ffff   (0x24c7 bytes)
> > 2(  2 mod 256): MAPREAD  0x2e947 thru 0x33163   (0x481d bytes)
> > 3(  3 mod 256): READ     0x2e836 thru 0x3cba1   (0xe36c bytes)
> > 4(  4 mod 256): PUNCH    0x2e7 thru 0x5c42      (0x595c bytes)
> > 5(  5 mod 256): MAPWRITE 0xcaea thru 0x13ba9    (0x70c0 bytes)  ******WWWW
> > 6(  6 mod 256): PUNCH    0x31645 thru 0x38d1c   (0x76d8 bytes)
> > 7(  7 mod 256): FALLOC   0x24f92 thru 0x2f2b7   (0xa325 bytes) INTERIOR
> > 8(  8 mod 256): FALLOC   0xbcf1 thru 0x171ac    (0xb4bb bytes) INTERIOR ******FFFF
> > 9(  9 mod 256): READ     0x126f thru 0x11136    (0xfec8 bytes)  ***RRRR***
> > Correct content saved for comparison
> > (maybe hexdump "/mnt/scr/fsx" vs "/mnt/scr/fsx.fsxgood")
> > 
> > XFS gives a good indication that we aren't doing something correctly
> > w.r.t. mapped XIP writes, as trying to fiemap the file ASSERT fails
> > with a delayed allocation extent somewhere inside the file after a
> > sync. I shall keep digging.
> 
> Ok, I understand the XFS ASSERT failure, but I don't really
> understand the reason for the read failure. XFS assert failed
> because I was using the delayed allocation enabled xfs_get_blocks()
> to xip_fault/xip_mkwrite, so it was creating a delalloc extent
> rather than allocating blocks, and then not having any pages in the
> page cache to write back to convert the delalloc extent. This
> doesn't explain the zeros being read, though.
> 
> So I changed to use the direct IO version, and that leaves me with
> an unwritten extent over the mapped write code. Why? Because there's
> no IO completion being run from either xip_fault() or xip_mkwrite()
> to zero the buffers and run IO completion to convert the extent to
> written....
> 
> $ xfs_io -f -c "truncate 8k" -c "mmap 0 8k" -c "mwrite 0 4k" \
> > -c "bmap -vp" -c "pread -v 0 8k" -c "bmap -vp" /mnt/scr/foo
> ....
> /mnt/scr/foo:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>    0: [0..7]:          224..231          0 (224..231)           8 10000
>    1: [8..15]:         hole                                     8
> $
> 
> We're trying to do something that the get_block callback has never
> supported.  I note that you added zeroing to ext4_map_blocks() when
> an unwritten extent is found and call xip_clear_blocks() from there
> to try and handle this within the allocation context without
> actually making it obvious why it is necessary.
> 
> Essentially what we need get_blocks(create = 1) to do here is this:
> 
> 	if (hole)
> 		transactionally allocate and zero block in requested region
> 	if (unwritten)
> 		transactionally convert to written and zero block
> 	if (written)
> 		map blocks
> 
> I think we can get away with this from a crash recovery perspective
> because the zeroing of the blocks is synchronous and within the
> allocation transaction. I'm implementing a new xfs_get_blocks_xip to
> do keep this new behaviour "separate" from the direct IO path
> semantics.
> 
> I also got rid of the read block map followed by the "create" block
> map. Just a single call with create set appropriately for the caller
> context is all that is required - the getblock call will do the
> correct thing for allocation/conversion cases and if there's already
> a block there it will just return the mapping....
> 
> <hack, hack>
> 
> OK, I've fixed something. The above xfs_io test returns the correct
> data on read now, fsx still fails. I'll keep working on it in the
> morning, and when I have something that works I'll post it....

The read/write path is broken, Willy. We can't map arbitrary byte
ranges to the DIO subsystem. I'm now certain that the data
corruptions I'm seeing are in sub-sector regions from unaligned IOs
from userspace. We still need to use the buffered IO path for non
O_DIRECT IO to avoid these problems. I think I've worked out a way
to short-circuit page cache lookups for the buffered IO path, so
stay tuned....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4
  2014-01-31  3:06     ` Dave Chinner
@ 2014-01-31  5:45       ` Ross Zwisler
  2014-01-31 13:04         ` Dave Chinner
  0 siblings, 1 reply; 45+ messages in thread
From: Ross Zwisler @ 2014-01-31  5:45 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, linux-kernel, linux-fsdevel, linux-mm, linux-ext4

On Fri, 31 Jan 2014, Dave Chinner wrote:
> The read/write path is broken, Willy. We can't map arbitrary byte
> ranges to the DIO subsystem. I'm now certain that the data
> corruptions I'm seeing are in sub-sector regions from unaligned IOs
> from userspace. We still need to use the buffered IO path for non
> O_DIRECT IO to avoid these problems. I think I've worked out a way
> to short-circuit page cache lookups for the buffered IO path, so
> stay tuned....

Hi Dave,

I found an issue that would cause reads to return bad data earlier this week,
and sent a response to "[PATCH v5 22/22] XIP: Add support for unwritten
extents".  Just wanted to make sure you're not running into that issue.  

I'm also currently chasing a write corruption where we lose the data that we
had just written because ext4 thinks the portion of the extent we had just
written needs to be converted from an unwritten extent to a written extent, so
it clears the data to all zeros via:

	xip_clear_blocks+0x53/0xd7
	ext4_map_blocks+0x306/0x3d9 [ext4]
	jbd2__journal_start+0xbd/0x188 [jbd2]
	ext4_convert_unwritten_extents+0xf9/0x1ac [ext4]
	ext4_direct_IO+0x2ca/0x3a5 [ext4]

This bug can be easily reproduced by fallocating an empty file up to a page,
and then writing into that page.  The first write is essentially lost, and the
page remains all zeros.  Subsequent writes succeed.

I'm still in the process of figuring out exactly why this is happening, but
unfortunately I won't be able to look at again until next week.  I don't know
if it's related to the corruption that you're seeing or not, just wanted to
let you know.

- Ross

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4
  2014-01-31  5:45       ` Ross Zwisler
@ 2014-01-31 13:04         ` Dave Chinner
  0 siblings, 0 replies; 45+ messages in thread
From: Dave Chinner @ 2014-01-31 13:04 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Matthew Wilcox, linux-kernel, linux-fsdevel, linux-mm, linux-ext4

On Thu, Jan 30, 2014 at 10:45:26PM -0700, Ross Zwisler wrote:
> On Fri, 31 Jan 2014, Dave Chinner wrote:
> > The read/write path is broken, Willy. We can't map arbitrary byte
> > ranges to the DIO subsystem. I'm now certain that the data
> > corruptions I'm seeing are in sub-sector regions from unaligned IOs
> > from userspace. We still need to use the buffered IO path for non
> > O_DIRECT IO to avoid these problems. I think I've worked out a way
> > to short-circuit page cache lookups for the buffered IO path, so
> > stay tuned....
> 
> Hi Dave,
> 
> I found an issue that would cause reads to return bad data earlier this week,
> and sent a response to "[PATCH v5 22/22] XIP: Add support for unwritten
> extents".  Just wanted to make sure you're not running into that issue.  

After having a couple of good strong bourbons, It came to me about
15 minutes ago that I've had a case of forest, trees and bears
today. What I said above is most likely wrong and I think there's
probably a simple reason for the bug I'm seeing: xip_io() does
not handle the buffer_new(bh) case correctly. i.e. this case:


	  newly allocated buffer
	+-------------------------------+
	+-zero-+----user data----+-zero-+

i.e. it doesn't zero the partial head and tail of the block
correctly if the block has just been allocated. If the block has
just been allocated, get_block() is supposed to return the bh marked
as buffer_new() to indicate to the caller it needs to have regions
that aren't covered with data zeroed.

Yes, makes fsx run for more than 36 operations. But it only makes it
to 79 operations, and then....

# xfs_io -tf -c "truncate 18k" -c "pwrite 19k 21k" -c "pread -v 18k 2k" -c "bmap -vp" /mnt/scr/foo
....
EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
   0: [0..31]:         hole                                    32
   1: [32..39]:        120..127          0 (120..127)           8 10000
   2: [40..71]:        128..159          0 (128..159)          32 00000
   3: [72..79]:        216..223          0 (216..223)           8 00000

Ok, it's leaving an unwritten extent where it shouldn't be. That
explains the zeros instead of data. I'll look further at it in the
morning now I think I'm on the right track...

> I'm also currently chasing a write corruption where we lose the data that we
> had just written because ext4 thinks the portion of the extent we had just
> written needs to be converted from an unwritten extent to a written extent, so
> it clears the data to all zeros via:
> 
> 	xip_clear_blocks+0x53/0xd7
> 	ext4_map_blocks+0x306/0x3d9 [ext4]
> 	jbd2__journal_start+0xbd/0x188 [jbd2]
> 	ext4_convert_unwritten_extents+0xf9/0x1ac [ext4]
> 	ext4_direct_IO+0x2ca/0x3a5 [ext4]
> 
> This bug can be easily reproduced by fallocating an empty file up to a page,
> and then writing into that page.  The first write is essentially lost, and the
> page remains all zeros.  Subsequent writes succeed.

I don't think that's what I'm seeing. As it is, ext4 should not
be zeroing the block during unwritten extent conversion. This looks
to have been added
to make the xip_fault() path work, but I think it's wrong. What I
had to do for XFS in the xip_fault() path to ensure that we returned
allocated, zeroed blocks from xfs_get_blocks_xip() was:

	allocate as unwritten
	xip_clear_blocks
	mark extent as written

so that it behaves like the direct IO path in terms of the
"allocate, write data, mark unwritten" process. But for direct IO,
xip_io() needs to be doing the block zeroing in response to
get_blocks setting buffer_new(bh) after allocating the unwritten
extent. Then it can copy in the data and call endio to convert it to
written....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 06/22] Treat XIP like O_DIRECT
  2014-01-16  1:24 ` [PATCH v5 06/22] Treat XIP like O_DIRECT Matthew Wilcox
@ 2014-01-31 16:59   ` Jan Kara
  0 siblings, 0 replies; 45+ messages in thread
From: Jan Kara @ 2014-01-31 16:59 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-kernel, linux-fsdevel, linux-mm, linux-ext4

On Wed 15-01-14 20:24:24, Matthew Wilcox wrote:
> Instead of separate read and write methods, use the generic AIO
> infrastructure.  In addition to giving us support for AIO, this adds
> the locking between read() and truncate() that was missing.
> 
> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
> ---
>  fs/Makefile        |   1 +
>  fs/ext2/file.c     |   6 +-
>  fs/ext2/inode.c    |   7 +-
>  fs/xip.c           | 156 +++++++++++++++++++++++++++++++++++
>  include/linux/fs.h |  18 ++++-
>  mm/filemap.c       |   6 +-
>  mm/filemap_xip.c   | 234 -----------------------------------------------------
>  7 files changed, 183 insertions(+), 245 deletions(-)
>  create mode 100644 fs/xip.c
> 
...
> +static ssize_t xip_io(int rw, struct inode *inode, const struct iovec *iov,
> +			loff_t start, loff_t end, unsigned nr_segs,
> +			get_block_t get_block, struct buffer_head *bh)
> +{
> +	ssize_t retval = 0;
> +	unsigned seg = 0;
> +	unsigned len;
> +	unsigned copied = 0;
> +	loff_t offset = start;
> +	loff_t max = start;
> +	void *addr;
> +	bool hole = false;
> +
> +	while (offset < end) {
> +		void __user *buf = iov[seg].iov_base + copied;
> +
> +		if (max == offset) {
> +			sector_t block = offset >> inode->i_blkbits;
> +			long size;
> +			memset(bh, 0, sizeof(*bh));
> +			bh->b_size = ALIGN(end - offset, PAGE_SIZE);
> +			retval = get_block(inode, block, bh, rw == WRITE);
> +			if (retval)
> +				break;
> +			if (buffer_mapped(bh)) {
> +				retval = xip_get_addr(inode, bh, &addr);
> +				if (retval < 0)
> +					break;
> +				addr += offset - (block << inode->i_blkbits);
> +				hole = false;
> +				size = retval;
  I think you are missing here zeroing out partial pages at the head and
the tail of the mapped extent when buffer_new(bh). Dave Chinner found this
in his testing as well I think.

								Honza

> +			} else {
> +				if (rw == WRITE) {
> +					retval = -EIO;
> +					break;
> +				}
> +				addr = NULL;
> +				hole = true;
> +				size = bh->b_size;
> +			}
> +			max = offset + size;
> +		}
> +
> +		len = min_t(unsigned, iov[seg].iov_len - copied, max - offset);
> +
> +		if (rw == WRITE)
> +			len -= __copy_from_user_nocache(addr, buf, len);
> +		else if (!hole)
> +			len -= __copy_to_user(buf, addr, len);
> +		else
> +			len -= __clear_user(buf, len);
> +
> +		if (!len)
> +			break;
> +
> +		offset += len;
> +		copied += len;
> +		if (copied == iov[seg].iov_len) {
> +			seg++;
> +			copied = 0;
> +		}
> +	}
> +
> +	return (offset == start) ? retval : offset - start;
> +}
> +
> +/**
> + * xip_do_io - Perform I/O to an XIP file
> + * @rw: READ to read or WRITE to write
> + * @iocb: The control block for this I/O
> + * @inode: The file which the I/O is directed at
> + * @iov: The user addresses to do I/O from or to
> + * @offset: The file offset where the I/O starts
> + * @nr_segs: The length of the iov array
> + * @get_block: The filesystem method used to translate file offsets to blocks
> + * @end_io: A filesystem callback for I/O completion
> + * @flags: See below
> + *
> + * This function uses the same locking scheme as do_blockdev_direct_IO:
> + * If @flags has DIO_LOCKING set, we assume that the i_mutex is held by the
> + * caller for writes.  For reads, we take and release the i_mutex ourselves.
> + * If DIO_LOCKING is not set, the filesystem takes care of its own locking.
> + * As with do_blockdev_direct_IO(), we increment i_dio_count while the I/O
> + * is in progress.
> + */
> +ssize_t xip_do_io(int rw, struct kiocb *iocb, struct inode *inode,
> +		const struct iovec *iov, loff_t offset, unsigned nr_segs,
> +		get_block_t get_block, dio_iodone_t end_io, int flags)
> +{
> +	struct buffer_head bh;
> +	unsigned seg;
> +	ssize_t retval = -EINVAL;
> +	loff_t end = offset;
> +
> +	for (seg = 0; seg < nr_segs; seg++)
> +		end += iov[seg].iov_len;
> +
> +	if ((flags & DIO_LOCKING) && (rw == READ)) {
> +		struct address_space *mapping = inode->i_mapping;
> +		mutex_lock(&inode->i_mutex);
> +		retval = filemap_write_and_wait_range(mapping, offset, end - 1);
> +		if (retval) {
> +			mutex_unlock(&inode->i_mutex);
> +			goto out;
> +		}
> +	}
> +
> +	/* Protects against truncate */
> +	atomic_inc(&inode->i_dio_count);
> +
> +	retval = xip_io(rw, inode, iov, offset, end, nr_segs, get_block, &bh);
> +
> +	if ((flags & DIO_LOCKING) && (rw == READ))
> +		mutex_unlock(&inode->i_mutex);
> +
> +	inode_dio_done(inode);
> +
> +	if ((retval > 0) && end_io)
> +		end_io(iocb, offset, retval, bh.b_private);
> + out:
> +	return retval;
> +}
> +EXPORT_SYMBOL_GPL(xip_do_io);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 80cfb42..7cc5bf7 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2509,17 +2509,22 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
>  extern int nonseekable_open(struct inode * inode, struct file * filp);
>  
>  #ifdef CONFIG_FS_XIP
> -extern ssize_t xip_file_read(struct file *filp, char __user *buf, size_t len,
> -			     loff_t *ppos);
>  extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
> -extern ssize_t xip_file_write(struct file *filp, const char __user *buf,
> -			      size_t len, loff_t *ppos);
>  extern int xip_truncate_page(struct address_space *mapping, loff_t from);
> +ssize_t xip_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
> +		loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
>  #else
>  static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
>  {
>  	return 0;
>  }
> +
> +static inline ssize_t xip_do_io(int rw, struct kiocb *iocb, struct inode *inode,
> +		const struct iovec *iov, loff_t offset, unsigned nr_segs,
> +		get_block_t get_block, dio_iodone_t end_io, int flags)
> +{
> +	return -ENOTTY;
> +}
>  #endif
>  
>  #ifdef CONFIG_BLOCK
> @@ -2669,6 +2674,11 @@ extern int generic_show_options(struct seq_file *m, struct dentry *root);
>  extern void save_mount_options(struct super_block *sb, char *options);
>  extern void replace_mount_options(struct super_block *sb, char *options);
>  
> +static inline bool io_is_direct(struct file *filp)
> +{
> +	return (filp->f_flags & O_DIRECT) || IS_XIP(file_inode(filp));
> +}
> +
>  static inline ino_t parent_ino(struct dentry *dentry)
>  {
>  	ino_t res;
> diff --git a/mm/filemap.c b/mm/filemap.c
> index b7749a9..61a31f0 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1417,8 +1417,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
>  	if (retval)
>  		return retval;
>  
> -	/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
> -	if (filp->f_flags & O_DIRECT) {
> +	if (io_is_direct(filp)) {
>  		loff_t size;
>  		struct address_space *mapping;
>  		struct inode *inode;
> @@ -2470,8 +2469,7 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
>  	if (err)
>  		goto out;
>  
> -	/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
> -	if (unlikely(file->f_flags & O_DIRECT)) {
> +	if (io_is_direct(file)) {
>  		loff_t endbyte;
>  		ssize_t written_buffered;
>  
> diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
> index c8d23e9..f7c37a1 100644
> --- a/mm/filemap_xip.c
> +++ b/mm/filemap_xip.c
> @@ -42,119 +42,6 @@ static struct page *xip_sparse_page(void)
>  }
>  
>  /*
> - * This is a file read routine for execute in place files, and uses
> - * the mapping->a_ops->get_xip_mem() function for the actual low-level
> - * stuff.
> - *
> - * Note the struct file* is not used at all.  It may be NULL.
> - */
> -static ssize_t
> -do_xip_mapping_read(struct address_space *mapping,
> -		    struct file_ra_state *_ra,
> -		    struct file *filp,
> -		    char __user *buf,
> -		    size_t len,
> -		    loff_t *ppos)
> -{
> -	struct inode *inode = mapping->host;
> -	pgoff_t index, end_index;
> -	unsigned long offset;
> -	loff_t isize, pos;
> -	size_t copied = 0, error = 0;
> -
> -	BUG_ON(!mapping->a_ops->get_xip_mem);
> -
> -	pos = *ppos;
> -	index = pos >> PAGE_CACHE_SHIFT;
> -	offset = pos & ~PAGE_CACHE_MASK;
> -
> -	isize = i_size_read(inode);
> -	if (!isize)
> -		goto out;
> -
> -	end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
> -	do {
> -		unsigned long nr, left;
> -		void *xip_mem;
> -		unsigned long xip_pfn;
> -		int zero = 0;
> -
> -		/* nr is the maximum number of bytes to copy from this page */
> -		nr = PAGE_CACHE_SIZE;
> -		if (index >= end_index) {
> -			if (index > end_index)
> -				goto out;
> -			nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
> -			if (nr <= offset) {
> -				goto out;
> -			}
> -		}
> -		nr = nr - offset;
> -		if (nr > len - copied)
> -			nr = len - copied;
> -
> -		error = mapping->a_ops->get_xip_mem(mapping, index, 0,
> -							&xip_mem, &xip_pfn);
> -		if (unlikely(error)) {
> -			if (error == -ENODATA) {
> -				/* sparse */
> -				zero = 1;
> -			} else
> -				goto out;
> -		}
> -
> -		/* If users can be writing to this page using arbitrary
> -		 * virtual addresses, take care about potential aliasing
> -		 * before reading the page on the kernel side.
> -		 */
> -		if (mapping_writably_mapped(mapping))
> -			/* address based flush */ ;
> -
> -		/*
> -		 * Ok, we have the mem, so now we can copy it to user space...
> -		 *
> -		 * The actor routine returns how many bytes were actually used..
> -		 * NOTE! This may not be the same as how much of a user buffer
> -		 * we filled up (we may be padding etc), so we can only update
> -		 * "pos" here (the actor routine has to update the user buffer
> -		 * pointers and the remaining count).
> -		 */
> -		if (!zero)
> -			left = __copy_to_user(buf+copied, xip_mem+offset, nr);
> -		else
> -			left = __clear_user(buf + copied, nr);
> -
> -		if (left) {
> -			error = -EFAULT;
> -			goto out;
> -		}
> -
> -		copied += (nr - left);
> -		offset += (nr - left);
> -		index += offset >> PAGE_CACHE_SHIFT;
> -		offset &= ~PAGE_CACHE_MASK;
> -	} while (copied < len);
> -
> -out:
> -	*ppos = pos + copied;
> -	if (filp)
> -		file_accessed(filp);
> -
> -	return (copied ? copied : error);
> -}
> -
> -ssize_t
> -xip_file_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
> -{
> -	if (!access_ok(VERIFY_WRITE, buf, len))
> -		return -EFAULT;
> -
> -	return do_xip_mapping_read(filp->f_mapping, &filp->f_ra, filp,
> -			    buf, len, ppos);
> -}
> -EXPORT_SYMBOL_GPL(xip_file_read);
> -
> -/*
>   * __xip_unmap is invoked from xip_unmap and
>   * xip_write
>   *
> @@ -340,127 +227,6 @@ int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
>  }
>  EXPORT_SYMBOL_GPL(xip_file_mmap);
>  
> -static ssize_t
> -__xip_file_write(struct file *filp, const char __user *buf,
> -		  size_t count, loff_t pos, loff_t *ppos)
> -{
> -	struct address_space * mapping = filp->f_mapping;
> -	const struct address_space_operations *a_ops = mapping->a_ops;
> -	struct inode 	*inode = mapping->host;
> -	long		status = 0;
> -	size_t		bytes;
> -	ssize_t		written = 0;
> -
> -	BUG_ON(!mapping->a_ops->get_xip_mem);
> -
> -	do {
> -		unsigned long index;
> -		unsigned long offset;
> -		size_t copied;
> -		void *xip_mem;
> -		unsigned long xip_pfn;
> -
> -		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
> -		index = pos >> PAGE_CACHE_SHIFT;
> -		bytes = PAGE_CACHE_SIZE - offset;
> -		if (bytes > count)
> -			bytes = count;
> -
> -		status = a_ops->get_xip_mem(mapping, index, 0,
> -						&xip_mem, &xip_pfn);
> -		if (status == -ENODATA) {
> -			/* we allocate a new page unmap it */
> -			mutex_lock(&xip_sparse_mutex);
> -			status = a_ops->get_xip_mem(mapping, index, 1,
> -							&xip_mem, &xip_pfn);
> -			mutex_unlock(&xip_sparse_mutex);
> -			if (!status)
> -				/* unmap page at pgoff from all other vmas */
> -				__xip_unmap(mapping, index);
> -		}
> -
> -		if (status)
> -			break;
> -
> -		copied = bytes -
> -			__copy_from_user_nocache(xip_mem + offset, buf, bytes);
> -
> -		if (likely(copied > 0)) {
> -			status = copied;
> -
> -			if (status >= 0) {
> -				written += status;
> -				count -= status;
> -				pos += status;
> -				buf += status;
> -			}
> -		}
> -		if (unlikely(copied != bytes))
> -			if (status >= 0)
> -				status = -EFAULT;
> -		if (status < 0)
> -			break;
> -	} while (count);
> -	*ppos = pos;
> -	/*
> -	 * No need to use i_size_read() here, the i_size
> -	 * cannot change under us because we hold i_mutex.
> -	 */
> -	if (pos > inode->i_size) {
> -		i_size_write(inode, pos);
> -		mark_inode_dirty(inode);
> -	}
> -
> -	return written ? written : status;
> -}
> -
> -ssize_t
> -xip_file_write(struct file *filp, const char __user *buf, size_t len,
> -	       loff_t *ppos)
> -{
> -	struct address_space *mapping = filp->f_mapping;
> -	struct inode *inode = mapping->host;
> -	size_t count;
> -	loff_t pos;
> -	ssize_t ret;
> -
> -	mutex_lock(&inode->i_mutex);
> -
> -	if (!access_ok(VERIFY_READ, buf, len)) {
> -		ret=-EFAULT;
> -		goto out_up;
> -	}
> -
> -	pos = *ppos;
> -	count = len;
> -
> -	/* We can write back this queue in page reclaim */
> -	current->backing_dev_info = mapping->backing_dev_info;
> -
> -	ret = generic_write_checks(filp, &pos, &count, S_ISBLK(inode->i_mode));
> -	if (ret)
> -		goto out_backing;
> -	if (count == 0)
> -		goto out_backing;
> -
> -	ret = file_remove_suid(filp);
> -	if (ret)
> -		goto out_backing;
> -
> -	ret = file_update_time(filp);
> -	if (ret)
> -		goto out_backing;
> -
> -	ret = __xip_file_write (filp, buf, count, pos, ppos);
> -
> - out_backing:
> -	current->backing_dev_info = NULL;
> - out_up:
> -	mutex_unlock(&inode->i_mutex);
> -	return ret;
> -}
> -EXPORT_SYMBOL_GPL(xip_file_write);
> -
>  /*
>   * truncate a page used for execute in place
>   * functionality is analog to block_truncate_page but does use get_xip_mem
> -- 
> 1.8.5.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 19/22] ext4: Add XIP functionality
       [not found] ` <CF1FF3EB.24114%matthew.r.wilcox@intel.com>
@ 2014-02-11 23:12   ` Ross Zwisler
  2014-02-13  0:00     ` Ross Zwisler
  0 siblings, 1 reply; 45+ messages in thread
From: Ross Zwisler @ 2014-02-11 23:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, linux-fsdevel, linux-mm, linux-ext4, Ross Zwisler

On Wed, 15 Jan 2014, Matthew Wilcox wrote:
> From: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> This is a port of the XIP functionality found in the current version of
> ext2.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
> [heavily tweaked]
> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>

...

> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c767666..8b73d77 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -663,6 +663,18 @@ found:
>  			WARN_ON(1);
>  		}
>  
> +		/* this is probably wrong for ext4.  unlike ext2, ext4 supports
> +		 * uninitialised extents, so we should probably be hooking
> +		 * into the "make it initialised" code instead. */
> +		if (IS_XIP(inode)) {

With the very first version of this patch the above logic seemed to work
correctly, zeroing blocks as we allocated them.  With the current XIP
infrastructure based tightly on direct IO this ends up being wrong because in
some cases we can call ext4_map_blocks() twice for a given block.  

A quick userland test program that creates a new file, truncates it up to 4k
and then does a partial block write will end up giving you a file filled with
all zeros.  This is because we zero the data before the write, do the write,
and then zero again, overwriting the data.  The second call to
ext4_map_blocks() happens via ext4_ext_direct_IO =>
ext4_convert_unwritten_extents() => ext4_map_blocks().

We can know in ext4_map_blocks() that we are being called after a write has
already completed by looking at the flags.  One solution to get around this
double-zeroing would be to change the above test to:

+                 if (IS_XIP(inode) && !(flags & EXT4_GET_BLOCKS_CONVERT)) {

This fixes the tests I've been able to come up with, but I'm not certain it's
the correct fix for the long term.  It seems wasteful to zero the blocks we're
allocating, just to have the zeros overwritten immediately by a write.  Maybe
a cleaner way would be to try and zero the unwritten bits inside of
ext4_convert_unwritten_extents(), or somewhere similar?

It's worth noting that I don't think the direct I/O path has this kind of
logic because they don't allow partial block writes.  The regular I/O path
knows to zero unwritten space based on the BH_New flag, as set via the
set_buffer_new() call in ext4_da_map_blocks().  This is a pretty different I/O
path, though, so I'm not sure how much we can borrow for the XIP code.

Thoughts on the correct fix?

- Ross

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 06/22] Treat XIP like O_DIRECT
       [not found] ` <CF215477.24281%matthew.r.wilcox@intel.com>
@ 2014-02-12 23:53   ` Ross Zwisler
  0 siblings, 0 replies; 45+ messages in thread
From: Ross Zwisler @ 2014-02-12 23:53 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-kernel, linux-fsdevel, linux-mm, linux-ext4

On Wed, 15 Jan 2014, Matthew Wilcox wrote:
> Instead of separate read and write methods, use the generic AIO
> infrastructure.  In addition to giving us support for AIO, this adds
> the locking between read() and truncate() that was missing.
> 
> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>

...

> +static ssize_t xip_io(int rw, struct inode *inode, const struct iovec
> *iov,
> +			loff_t start, loff_t end, unsigned nr_segs,
> +			get_block_t get_block, struct buffer_head *bh)
> +{
> +	ssize_t retval = 0;
> +	unsigned seg = 0;
> +	unsigned len;
> +	unsigned copied = 0;
> +	loff_t offset = start;
> +	loff_t max = start;
> +	void *addr;
> +	bool hole = false;
> +
> +	while (offset < end) {
> +		void __user *buf = iov[seg].iov_base + copied;
> +
> +		if (max == offset) {
> +			sector_t block = offset >> inode->i_blkbits;
> +			long size;
> +			memset(bh, 0, sizeof(*bh));
> +			bh->b_size = ALIGN(end - offset, PAGE_SIZE);
> +			retval = get_block(inode, block, bh, rw == WRITE);
> +			if (retval)
> +				break;
> +			if (buffer_mapped(bh)) {
> +				retval = xip_get_addr(inode, bh, &addr);
> +				if (retval < 0)
> +					break;
> +				addr += offset - (block << inode->i_blkbits);
> +				hole = false;
> +				size = retval;
> +			} else {
> +				if (rw == WRITE) {
> +					retval = -EIO;
> +					break;
> +				}
> +				addr = NULL;
> +				hole = true;
> +				size = bh->b_size;
> +			}
> +			max = offset + size;
> +		}
> +
> +		len = min_t(unsigned, iov[seg].iov_len - copied, max - offset);
> +
> +		if (rw == WRITE)
> +			len -= __copy_from_user_nocache(addr, buf, len);
> +		else if (!hole)
> +			len -= __copy_to_user(buf, addr, len);
> +		else
> +			len -= __clear_user(buf, len);
> +
> +		if (!len)
> +			break;
> +
> +		offset += len;
> +		copied += len;
> +		if (copied == iov[seg].iov_len) {
> +			seg++;
> +			copied = 0;
> +		}
> +	}
> +
> +	return (offset == start) ? retval : offset - start;
> +}

xip_io() as it is currently written has an issue where reads can go beyond
inode->i_size.  A quick test to show this issue is:

	create a new file
	write to the file for 1/2 a block
	seek back to 0
	read for a full block

The read in this case will return 4096, the length of the full block that was
requested.  It should return 2048, reading just the data that was written.

The issue is that we do have a full block allocated in ext4, we do have it
available via XIP via xip_get_addr(), and the only extra check that we
currently have is a check against iov_len.  iov_len in this case is 4096, so
no one stops us from doing a full block read.

Here is a quick patch that fixes this issue:

diff --git a/fs/xip.c b/fs/xip.c
index e902593..1608f29 100644
--- a/fs/xip.c
+++ b/fs/xip.c
@@ -91,13 +91,16 @@ static ssize_t xip_io(int rw, struct inode *inode, const struct
 {
        ssize_t retval = 0;
        unsigned seg = 0;
-       unsigned len;
+       unsigned len, total_len;
        unsigned copied = 0;
        loff_t offset = start;
        loff_t max = start;
        void *addr;
        bool hole = false;
 
+       end = min(end, inode->i_size);
+       total_len = end - start;
+
        while (offset < end) {
                void __user *buf = iov[seg].iov_base + copied;
 
@@ -136,6 +139,7 @@ static ssize_t xip_io(int rw, struct inode *inode, const struct
                }
 
                len = min_t(unsigned, iov[seg].iov_len - copied, max - offset);
+               len = min(len, total_len);
 
                if (rw == WRITE)
                        len -= __copy_from_user_nocache(addr, buf, len);
@@ -149,6 +153,7 @@ static ssize_t xip_io(int rw, struct inode *inode, const struct
 
                offset += len;
                copied += len;
+               total_len -= len;
                if (copied == iov[seg].iov_len) {
                        seg++;
                        copied = 0;

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 19/22] ext4: Add XIP functionality
  2014-02-11 23:12   ` [PATCH v5 19/22] ext4: Add XIP functionality Ross Zwisler
@ 2014-02-13  0:00     ` Ross Zwisler
  0 siblings, 0 replies; 45+ messages in thread
From: Ross Zwisler @ 2014-02-13  0:00 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Matthew Wilcox, linux-kernel, linux-fsdevel, linux-mm, linux-ext4, david

On Tue, 11 Feb 2014, Ross Zwisler wrote:
> On Wed, 15 Jan 2014, Matthew Wilcox wrote:
> > From: Ross Zwisler <ross.zwisler@linux.intel.com>
> > 
> > This is a port of the XIP functionality found in the current version of
> > ext2.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
> > [heavily tweaked]
> > Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
> 
> ...
> 
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index c767666..8b73d77 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -663,6 +663,18 @@ found:
> >  			WARN_ON(1);
> >  		}
> >  
> > +		/* this is probably wrong for ext4.  unlike ext2, ext4 supports
> > +		 * uninitialised extents, so we should probably be hooking
> > +		 * into the "make it initialised" code instead. */
> > +		if (IS_XIP(inode)) {
> 
> With the very first version of this patch the above logic seemed to work
> correctly, zeroing blocks as we allocated them.  With the current XIP
> infrastructure based tightly on direct IO this ends up being wrong because in
> some cases we can call ext4_map_blocks() twice for a given block.  
> 
> A quick userland test program that creates a new file, truncates it up to 4k
> and then does a partial block write will end up giving you a file filled with
> all zeros.  This is because we zero the data before the write, do the write,
> and then zero again, overwriting the data.  The second call to
> ext4_map_blocks() happens via ext4_ext_direct_IO =>
> ext4_convert_unwritten_extents() => ext4_map_blocks().
> 
> We can know in ext4_map_blocks() that we are being called after a write has
> already completed by looking at the flags.  One solution to get around this
> double-zeroing would be to change the above test to:
> 
> +                 if (IS_XIP(inode) && !(flags & EXT4_GET_BLOCKS_CONVERT)) {
> 
> This fixes the tests I've been able to come up with, but I'm not certain it's
> the correct fix for the long term.  It seems wasteful to zero the blocks we're
> allocating, just to have the zeros overwritten immediately by a write.  Maybe
> a cleaner way would be to try and zero the unwritten bits inside of
> ext4_convert_unwritten_extents(), or somewhere similar?
> 
> It's worth noting that I don't think the direct I/O path has this kind of
> logic because they don't allow partial block writes.  The regular I/O path
> knows to zero unwritten space based on the BH_New flag, as set via the
> set_buffer_new() call in ext4_da_map_blocks().  This is a pretty different I/O
> path, though, so I'm not sure how much we can borrow for the XIP code.
> 
> Thoughts on the correct fix?
> 
> - Ross

It looks like Dave Chinner outlined a way to deal with this in response to the
"[PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4" mail.

I'll try and implement things as Dave has described (zero full blocks in the
case of xip_fault() and mark extents as written, use buffer_new(bh) to zero
edges for normal I/O) and send out code or questions as I have them.

- Ross

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2014-02-13  0:00 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-16  1:24 [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Matthew Wilcox
2014-01-16  1:24 ` [PATCH v5 01/22] Fix XIP fault vs truncate race Matthew Wilcox
2014-01-16  1:24 ` [PATCH v5 02/22] Allow page fault handlers to perform the COW Matthew Wilcox
2014-01-16  1:24 ` [PATCH v5 03/22] axonram: Fix bug in direct_access Matthew Wilcox
2014-01-16  1:24 ` [PATCH v5 04/22] Change direct_access calling convention Matthew Wilcox
2014-01-16  1:24 ` [PATCH v5 05/22] Introduce IS_XIP(inode) Matthew Wilcox
2014-01-16  1:24 ` [PATCH v5 06/22] Treat XIP like O_DIRECT Matthew Wilcox
2014-01-31 16:59   ` Jan Kara
2014-01-16  1:24 ` [PATCH v5 07/22] Rewrite XIP page fault handling Matthew Wilcox
2014-01-16  1:24 ` [PATCH v5 08/22] Change xip_truncate_page to take a get_block parameter Matthew Wilcox
2014-01-16  1:24 ` [PATCH v5 09/22] Remove mm/filemap_xip.c Matthew Wilcox
2014-01-16  1:24 ` [PATCH v5 10/22] Remove get_xip_mem Matthew Wilcox
2014-01-16  1:46   ` Randy Dunlap
2014-01-27 13:26     ` Matthew Wilcox
2014-01-16  1:24 ` [PATCH v5 11/22] Replace ext2_clear_xip_target with xip_clear_blocks Matthew Wilcox
2014-01-16  1:24 ` [PATCH v5 12/22] ext2: Remove ext2_xip_verify_sb() Matthew Wilcox
2014-01-16  1:24 ` [PATCH v5 13/22] ext2: Remove ext2_use_xip Matthew Wilcox
2014-01-16  1:24 ` [PATCH v5 14/22] ext2: Remove xip.c and xip.h Matthew Wilcox
2014-01-16  1:24 ` [PATCH v5 15/22] Remove CONFIG_EXT2_FS_XIP Matthew Wilcox
2014-01-16  1:24 ` [PATCH v5 16/22] ext2: Remove ext2_aops_xip Matthew Wilcox
2014-01-16  1:24 ` [PATCH v5 17/22] xip: Add xip_zero_page_range Matthew Wilcox
2014-01-16  1:24 ` [PATCH v5 18/22] ext4: Make ext4_block_zero_page_range static Matthew Wilcox
2014-01-16  1:24 ` [PATCH v5 19/22] ext4: Add XIP functionality Matthew Wilcox
2014-01-16  1:24 ` [PATCH v5 20/22] ext4: Fix typos Matthew Wilcox
2014-01-16  1:24 ` [PATCH v5 21/22] xip: Add reporting of major faults Matthew Wilcox
2014-01-16  1:24 ` [PATCH v5 22/22] XIP: Add support for unwritten extents Matthew Wilcox
     [not found] ` <CEFDA737.22F87%matthew.r.wilcox@intel.com>
2014-01-17  0:00   ` [PATCH v5 19/22] ext4: Add XIP functionality Ross Zwisler
     [not found] ` <CEFD7DAD.22F65%matthew.r.wilcox@intel.com>
2014-01-22 22:51   ` [PATCH v5 22/22] XIP: Add support for unwritten extents Ross Zwisler
2014-01-23 12:08     ` Matthew Wilcox
2014-01-23 19:13       ` Ross Zwisler
     [not found]     ` <CF0C370C.235F1%willy@linux.intel.com>
2014-01-27 23:32       ` Ross Zwisler
2014-01-28  3:49         ` Matthew Wilcox
2014-01-23  7:48 ` [PATCH v5 00/22] Rewrite XIP code and add XIP support to ext4 Dave Chinner
2014-01-23  7:53   ` Dave Chinner
2014-01-23  9:01 ` Dave Chinner
2014-01-23 12:12   ` Wilcox, Matthew R
2014-01-28  6:06     ` Dave Chinner
2014-01-30  6:42 ` Dave Chinner
2014-01-30  9:25   ` Dave Chinner
2014-01-31  3:06     ` Dave Chinner
2014-01-31  5:45       ` Ross Zwisler
2014-01-31 13:04         ` Dave Chinner
     [not found] ` <CF1FF3EB.24114%matthew.r.wilcox@intel.com>
2014-02-11 23:12   ` [PATCH v5 19/22] ext4: Add XIP functionality Ross Zwisler
2014-02-13  0:00     ` Ross Zwisler
     [not found] ` <CF215477.24281%matthew.r.wilcox@intel.com>
2014-02-12 23:53   ` [PATCH v5 06/22] Treat XIP like O_DIRECT Ross Zwisler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).