All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/5] new truncate sequence patchset
@ 2009-07-10  7:30 npiggin
  2009-07-10  7:30 ` [patch 1/5] fs: new truncate helpers npiggin
                   ` (6 more replies)
  0 siblings, 7 replies; 28+ messages in thread
From: npiggin @ 2009-07-10  7:30 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: hch, viro, jack

Hi,

I decided to keep the ATTR_SIZE masking code in there. I think
inode_setattr should basically be a simple_setattr, and attribute
masking is not too unusual. If there is large demand, then we
could impement a notruncate_setattr function to avoid masking, but
for now I think this is OK.

Please consider this for merging.

Thanks,
Nick



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 1/5] fs: new truncate helpers
  2009-07-10  7:30 [patch 0/5] new truncate sequence patchset npiggin
@ 2009-07-10  7:30 ` npiggin
  2009-07-10  7:30   ` npiggin
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 28+ messages in thread
From: npiggin @ 2009-07-10  7:30 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: hch, viro, jack

[-- Attachment #1: fs-new-truncate-helpers.patch --]
[-- Type: text/plain, Size: 11946 bytes --]

Introduce new truncate helpers truncate_pagecache and inode_newsize_ok.
vmtruncate is also consolidated from mm/memory.c and mm/nommu.c and
into mm/truncate.c.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 Documentation/vm/locking |    2 -
 fs/attr.c                |   44 ++++++++++++++++++++++++++++++
 include/linux/fs.h       |    1 
 include/linux/mm.h       |    1 
 mm/filemap.c             |    2 -
 mm/memory.c              |   62 ++-----------------------------------------
 mm/mremap.c              |    4 +-
 mm/nommu.c               |   40 ----------------------------
 mm/truncate.c            |   67 +++++++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 119 insertions(+), 104 deletions(-)

Index: linux-2.6/fs/attr.c
===================================================================
--- linux-2.6.orig/fs/attr.c
+++ linux-2.6/fs/attr.c
@@ -60,9 +60,51 @@ fine:
 error:
 	return retval;
 }
-
 EXPORT_SYMBOL(inode_change_ok);
 
+/**
+ * inode_newsize_ok - may this inode be truncated to a given size
+ * @inode:	the inode to be truncated
+ * @offset:	the new size to assign to the inode
+ * @Returns:	0 on success, -ve errno on failure
+ *
+ * inode_newsize_ok will check filesystem limits and ulimits to check that the
+ * new inode size is within limits. inode_newsize_ok will also send SIGXFSZ
+ * when necessary. Caller must not proceed with inode size change if failure is
+ * returned. @inode must be a file (not directory), with appropriate
+ * permissions to allow truncate (inode_newsize_ok does NOT check these
+ * conditions).
+ *
+ * inode_newsize_ok must be called with i_mutex held.
+ */
+int inode_newsize_ok(struct inode *inode, loff_t offset)
+{
+	if (inode->i_size < offset) {
+		unsigned long limit;
+
+		limit = current->signal->rlim[RLIMIT_FSIZE].rlim_cur;
+		if (limit != RLIM_INFINITY && offset > limit)
+			goto out_sig;
+		if (offset > inode->i_sb->s_maxbytes)
+			goto out_big;
+	} else {
+		/*
+		 * truncation of in-use swapfiles is disallowed - it would
+		 * cause subsequent swapout to scribble on the now-freed
+		 * blocks.
+		 */
+		if (IS_SWAPFILE(inode))
+			return -ETXTBSY;
+	}
+
+	return 0;
+out_sig:
+	send_sig(SIGXFSZ, current, 0);
+out_big:
+	return -EFBIG;
+}
+EXPORT_SYMBOL(inode_newsize_ok);
+
 int inode_setattr(struct inode * inode, struct iattr * attr)
 {
 	unsigned int ia_valid = attr->ia_valid;
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -2367,6 +2367,7 @@ extern int buffer_migrate_page(struct ad
 #endif
 
 extern int inode_change_ok(struct inode *, struct iattr *);
+extern int inode_newsize_ok(struct inode *, loff_t offset);
 extern int __must_check inode_setattr(struct inode *, struct iattr *);
 
 extern void file_update_time(struct file *file);
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -805,6 +805,7 @@ static inline void unmap_shared_mapping_
 	unmap_mapping_range(mapping, holebegin, holelen, 0);
 }
 
+extern void truncate_pagecache(struct inode * inode, loff_t old, loff_t new);
 extern int vmtruncate(struct inode * inode, loff_t offset);
 extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end);
 
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -282,7 +282,8 @@ void free_pgtables(struct mmu_gather *tl
 		unsigned long addr = vma->vm_start;
 
 		/*
-		 * Hide vma from rmap and vmtruncate before freeing pgtables
+		 * Hide vma from rmap and truncate_pagecache before freeing
+		 * pgtables
 		 */
 		anon_vma_unlink(vma);
 		unlink_file_vma(vma);
@@ -2358,7 +2359,7 @@ restart:
  * @mapping: the address space containing mmaps to be unmapped.
  * @holebegin: byte in first page to unmap, relative to the start of
  * the underlying file.  This will be rounded down to a PAGE_SIZE
- * boundary.  Note that this is different from vmtruncate(), which
+ * boundary.  Note that this is different from truncate_pagecache(), which
  * must keep the partial page.  In contrast, we must get rid of
  * partial pages.
  * @holelen: size of prospective hole in bytes.  This will be rounded
@@ -2409,63 +2410,6 @@ void unmap_mapping_range(struct address_
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
-/**
- * vmtruncate - unmap mappings "freed" by truncate() syscall
- * @inode: inode of the file used
- * @offset: file offset to start truncating
- *
- * NOTE! We have to be ready to update the memory sharing
- * between the file and the memory map for a potential last
- * incomplete page.  Ugly, but necessary.
- */
-int vmtruncate(struct inode * inode, loff_t offset)
-{
-	if (inode->i_size < offset) {
-		unsigned long limit;
-
-		limit = current->signal->rlim[RLIMIT_FSIZE].rlim_cur;
-		if (limit != RLIM_INFINITY && offset > limit)
-			goto out_sig;
-		if (offset > inode->i_sb->s_maxbytes)
-			goto out_big;
-		i_size_write(inode, offset);
-	} else {
-		struct address_space *mapping = inode->i_mapping;
-
-		/*
-		 * truncation of in-use swapfiles is disallowed - it would
-		 * cause subsequent swapout to scribble on the now-freed
-		 * blocks.
-		 */
-		if (IS_SWAPFILE(inode))
-			return -ETXTBSY;
-		i_size_write(inode, offset);
-
-		/*
-		 * unmap_mapping_range is called twice, first simply for
-		 * efficiency so that truncate_inode_pages does fewer
-		 * single-page unmaps.  However after this first call, and
-		 * before truncate_inode_pages finishes, it is possible for
-		 * private pages to be COWed, which remain after
-		 * truncate_inode_pages finishes, hence the second
-		 * unmap_mapping_range call must be made for correctness.
-		 */
-		unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
-		truncate_inode_pages(mapping, offset);
-		unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
-	}
-
-	if (inode->i_op->truncate)
-		inode->i_op->truncate(inode);
-	return 0;
-
-out_sig:
-	send_sig(SIGXFSZ, current, 0);
-out_big:
-	return -EFBIG;
-}
-EXPORT_SYMBOL(vmtruncate);
-
 int vmtruncate_range(struct inode *inode, loff_t offset, loff_t end)
 {
 	struct address_space *mapping = inode->i_mapping;
Index: linux-2.6/mm/nommu.c
===================================================================
--- linux-2.6.orig/mm/nommu.c
+++ linux-2.6/mm/nommu.c
@@ -86,46 +86,6 @@ struct vm_operations_struct generic_file
 };
 
 /*
- * Handle all mappings that got truncated by a "truncate()"
- * system call.
- *
- * NOTE! We have to be ready to update the memory sharing
- * between the file and the memory map for a potential last
- * incomplete page.  Ugly, but necessary.
- */
-int vmtruncate(struct inode *inode, loff_t offset)
-{
-	struct address_space *mapping = inode->i_mapping;
-	unsigned long limit;
-
-	if (inode->i_size < offset)
-		goto do_expand;
-	i_size_write(inode, offset);
-
-	truncate_inode_pages(mapping, offset);
-	goto out_truncate;
-
-do_expand:
-	limit = current->signal->rlim[RLIMIT_FSIZE].rlim_cur;
-	if (limit != RLIM_INFINITY && offset > limit)
-		goto out_sig;
-	if (offset > inode->i_sb->s_maxbytes)
-		goto out;
-	i_size_write(inode, offset);
-
-out_truncate:
-	if (inode->i_op->truncate)
-		inode->i_op->truncate(inode);
-	return 0;
-out_sig:
-	send_sig(SIGXFSZ, current, 0);
-out:
-	return -EFBIG;
-}
-
-EXPORT_SYMBOL(vmtruncate);
-
-/*
  * Return the total memory allocated for this pointer, not
  * just what the caller asked for.
  *
Index: linux-2.6/Documentation/vm/locking
===================================================================
--- linux-2.6.orig/Documentation/vm/locking
+++ linux-2.6/Documentation/vm/locking
@@ -80,7 +80,7 @@ Note: PTL can also be used to guarantee
 mm start up ... this is a loose form of stability on mm_users. For
 example, it is used in copy_mm to protect against a racing tlb_gather_mmu
 single address space optimization, so that the zap_page_range (from
-vmtruncate) does not lose sending ipi's to cloned threads that might 
+truncate) does not lose sending ipi's to cloned threads that might
 be spawned underneath it and go to user mode to drag in pte's into tlbs.
 
 swap_lock
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -59,7 +59,7 @@
 /*
  * Lock ordering:
  *
- *  ->i_mmap_lock		(vmtruncate)
+ *  ->i_mmap_lock		(truncate_pagecache)
  *    ->private_lock		(__free_pte->__set_page_dirty_buffers)
  *      ->swap_lock		(exclusive_swap_page, others)
  *        ->mapping->tree_lock
Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c
+++ linux-2.6/mm/mremap.c
@@ -85,8 +85,8 @@ static void move_ptes(struct vm_area_str
 	if (vma->vm_file) {
 		/*
 		 * Subtle point from Rajesh Venkatasubramanian: before
-		 * moving file-based ptes, we must lock vmtruncate out,
-		 * since it might clean the dst vma before the src vma,
+		 * moving file-based ptes, we must lock truncate_pagecache
+		 * out, since it might clean the dst vma before the src vma,
 		 * and we propagate stale pages into the dst afterward.
 		 */
 		mapping = vma->vm_file->f_mapping;
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -465,3 +465,70 @@ int invalidate_inode_pages2(struct addre
 	return invalidate_inode_pages2_range(mapping, 0, -1);
 }
 EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
+
+/**
+ * truncate_pagecache - unmap mappings "freed" by truncate() syscall
+ * @inode: inode
+ * @old: old file offset
+ * @new: new file offset
+ *
+ * inode's new i_size must already be written before truncate_pagecache
+ * is called.
+ *
+ * This function should typically be called before the filesystem
+ * releases resources associated with the freed range (eg. deallocates
+ * blocks). This way, pagecache will always stay logically coherent
+ * with on-disk format, and the filesystem would not have to deal with
+ * situations such as writepage being called for a page that has already
+ * had its underlying blocks deallocated.
+ */
+void truncate_pagecache(struct inode * inode, loff_t old, loff_t new)
+{
+	if (new < old) {
+		struct address_space *mapping = inode->i_mapping;
+
+		/*
+		 * unmap_mapping_range is called twice, first simply for
+		 * efficiency so that truncate_inode_pages does fewer
+		 * single-page unmaps.  However after this first call, and
+		 * before truncate_inode_pages finishes, it is possible for
+		 * private pages to be COWed, which remain after
+		 * truncate_inode_pages finishes, hence the second
+		 * unmap_mapping_range call must be made for correctness.
+		 */
+		unmap_mapping_range(mapping, new + PAGE_SIZE - 1, 0, 1);
+		truncate_inode_pages(mapping, new);
+		unmap_mapping_range(mapping, new + PAGE_SIZE - 1, 0, 1);
+	}
+}
+EXPORT_SYMBOL(truncate_pagecache);
+
+/**
+ * vmtruncate - unmap mappings "freed" by truncate() syscall
+ * @inode: inode of the file used
+ * @offset: file offset to start truncating
+ *
+ * NOTE! We have to be ready to update the memory sharing
+ * between the file and the memory map for a potential last
+ * incomplete page.  Ugly, but necessary.
+ *
+ * This function is deprecated and simple_setsize or truncate_pagecache
+ * should be used instead.
+ */
+int vmtruncate(struct inode * inode, loff_t offset)
+{
+	loff_t oldsize;
+	int error;
+
+	error = inode_newsize_ok(inode, offset);
+	if (error)
+		return error;
+	oldsize = inode->i_size;
+	i_size_write(inode, offset);
+	truncate_pagecache(inode, oldsize, offset);
+	if (inode->i_op->truncate)
+		inode->i_op->truncate(inode);
+
+	return error;
+}
+EXPORT_SYMBOL(vmtruncate);



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 2/5] fs: use new truncate helpers
  2009-07-10  7:30 [patch 0/5] new truncate sequence patchset npiggin
@ 2009-07-10  7:30   ` npiggin
  2009-07-10  7:30   ` npiggin
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 28+ messages in thread
From: npiggin-l3A5Bk7waGM @ 2009-07-10  7:30 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: hch-wEGCiKHe2LqWVfeAwA7xHQ,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, jack-AlSwsSmVLrQ,
	Miklos Szeredi, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA,
	linux-cifs-client-w/Ol4Ecudpl8XjKLYN78aQ,
	sfrench-eUNUBHrolfbYtjvyW6yDsg

[-- Attachment #1: fs-new-truncate-convert.patch --]
[-- Type: text/plain, Size: 9336 bytes --]

Update some fs code to make use of new helper functions introduced
in the previous patch. Should be no significant change in behaviour
(except CIFS now calls send_sig under i_lock, via inode_newsize_ok).

Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Acked-by: Miklos Szeredi <miklos-sUDqSbJrdHQHWmgEVkV9KA@public.gmane.org>
Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org
Cc: linux-cifs-client-w/Ol4Ecudpl8XjKLYN78aQ@public.gmane.org
Cc: sfrench-eUNUBHrolfbYtjvyW6yDsg@public.gmane.org
Signed-off-by: Nick Piggin <npiggin-l3A5Bk7waGM@public.gmane.org>
---
 fs/buffer.c           |   10 +--------
 fs/cifs/inode.c       |   51 ++++++++-----------------------------------------
 fs/fuse/dir.c         |   14 +++----------
 fs/fuse/fuse_i.h      |    2 -
 fs/fuse/inode.c       |   11 ----------
 fs/nfs/inode.c        |   52 +++++++++++---------------------------------------
 fs/ramfs/file-nommu.c |   18 ++++-------------
 7 files changed, 33 insertions(+), 125 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2225,16 +2225,10 @@ int generic_cont_expand_simple(struct in
 	struct address_space *mapping = inode->i_mapping;
 	struct page *page;
 	void *fsdata;
-	unsigned long limit;
 	int err;
 
-	err = -EFBIG;
-        limit = current->signal->rlim[RLIMIT_FSIZE].rlim_cur;
-	if (limit != RLIM_INFINITY && size > (loff_t)limit) {
-		send_sig(SIGXFSZ, current, 0);
-		goto out;
-	}
-	if (size > inode->i_sb->s_maxbytes)
+	err = inode_newsize_ok(inode, size);
+	if (err)
 		goto out;
 
 	err = pagecache_write_begin(NULL, mapping, size, 0,
Index: linux-2.6/fs/cifs/inode.c
===================================================================
--- linux-2.6.orig/fs/cifs/inode.c
+++ linux-2.6/fs/cifs/inode.c
@@ -1645,57 +1645,24 @@ static int cifs_truncate_page(struct add
 
 static int cifs_vmtruncate(struct inode *inode, loff_t offset)
 {
-	struct address_space *mapping = inode->i_mapping;
-	unsigned long limit;
+	loff_t oldsize;
+	int err;
 
 	spin_lock(&inode->i_lock);
-	if (inode->i_size < offset)
-		goto do_expand;
-	/*
-	 * truncation of in-use swapfiles is disallowed - it would cause
-	 * subsequent swapout to scribble on the now-freed blocks.
-	 */
-	if (IS_SWAPFILE(inode)) {
+	err = inode_newsize_ok(inode, offset);
+	if (err) {
 		spin_unlock(&inode->i_lock);
-		goto out_busy;
+		goto out;
 	}
-	i_size_write(inode, offset);
-	spin_unlock(&inode->i_lock);
-	/*
-	 * unmap_mapping_range is called twice, first simply for efficiency
-	 * so that truncate_inode_pages does fewer single-page unmaps. However
-	 * after this first call, and before truncate_inode_pages finishes,
-	 * it is possible for private pages to be COWed, which remain after
-	 * truncate_inode_pages finishes, hence the second unmap_mapping_range
-	 * call must be made for correctness.
-	 */
-	unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
-	truncate_inode_pages(mapping, offset);
-	unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
-	goto out_truncate;
 
-do_expand:
-	limit = current->signal->rlim[RLIMIT_FSIZE].rlim_cur;
-	if (limit != RLIM_INFINITY && offset > limit) {
-		spin_unlock(&inode->i_lock);
-		goto out_sig;
-	}
-	if (offset > inode->i_sb->s_maxbytes) {
-		spin_unlock(&inode->i_lock);
-		goto out_big;
-	}
+	oldsize = inode->i_size;
 	i_size_write(inode, offset);
 	spin_unlock(&inode->i_lock);
-out_truncate:
+	truncate_pagecache(inode, oldsize, offset);
 	if (inode->i_op->truncate)
 		inode->i_op->truncate(inode);
-	return 0;
-out_sig:
-	send_sig(SIGXFSZ, current, 0);
-out_big:
-	return -EFBIG;
-out_busy:
-	return -ETXTBSY;
+out:
+	return err;
 }
 
 static int
Index: linux-2.6/fs/fuse/dir.c
===================================================================
--- linux-2.6.orig/fs/fuse/dir.c
+++ linux-2.6/fs/fuse/dir.c
@@ -1225,14 +1225,9 @@ static int fuse_do_setattr(struct dentry
 		return 0;
 
 	if (attr->ia_valid & ATTR_SIZE) {
-		unsigned long limit;
-		if (IS_SWAPFILE(inode))
-			return -ETXTBSY;
-		limit = current->signal->rlim[RLIMIT_FSIZE].rlim_cur;
-		if (limit != RLIM_INFINITY && attr->ia_size > (loff_t) limit) {
-			send_sig(SIGXFSZ, current, 0);
-			return -EFBIG;
-		}
+		err = inode_newsize_ok(inode, attr->ia_size);
+		if (err)
+			return err;
 		is_truncate = true;
 	}
 
@@ -1299,8 +1294,7 @@ static int fuse_do_setattr(struct dentry
 	 * FUSE_NOWRITE, otherwise fuse_launder_page() would deadlock.
 	 */
 	if (S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
-		if (outarg.attr.size < oldsize)
-			fuse_truncate(inode->i_mapping, outarg.attr.size);
+		truncate_pagecache(inode, oldsize, outarg.attr.size);
 		invalidate_inode_pages2(inode->i_mapping);
 	}
 
Index: linux-2.6/fs/nfs/inode.c
===================================================================
--- linux-2.6.orig/fs/nfs/inode.c
+++ linux-2.6/fs/nfs/inode.c
@@ -427,49 +427,21 @@ nfs_setattr(struct dentry *dentry, struc
  */
 static int nfs_vmtruncate(struct inode * inode, loff_t offset)
 {
-	if (i_size_read(inode) < offset) {
-		unsigned long limit;
+	loff_t oldsize;
+	int err;
 
-		limit = current->signal->rlim[RLIMIT_FSIZE].rlim_cur;
-		if (limit != RLIM_INFINITY && offset > limit)
-			goto out_sig;
-		if (offset > inode->i_sb->s_maxbytes)
-			goto out_big;
-		spin_lock(&inode->i_lock);
-		i_size_write(inode, offset);
-		spin_unlock(&inode->i_lock);
-	} else {
-		struct address_space *mapping = inode->i_mapping;
+	err = inode_newsize_ok(inode, offset);
+	if (err)
+		goto out;
 
-		/*
-		 * truncation of in-use swapfiles is disallowed - it would
-		 * cause subsequent swapout to scribble on the now-freed
-		 * blocks.
-		 */
-		if (IS_SWAPFILE(inode))
-			return -ETXTBSY;
-		spin_lock(&inode->i_lock);
-		i_size_write(inode, offset);
-		spin_unlock(&inode->i_lock);
+	spin_lock(&inode->i_lock);
+	oldsize = inode->i_size;
+	i_size_write(inode, offset);
+	spin_unlock(&inode->i_lock);
 
-		/*
-		 * unmap_mapping_range is called twice, first simply for
-		 * efficiency so that truncate_inode_pages does fewer
-		 * single-page unmaps.  However after this first call, and
-		 * before truncate_inode_pages finishes, it is possible for
-		 * private pages to be COWed, which remain after
-		 * truncate_inode_pages finishes, hence the second
-		 * unmap_mapping_range call must be made for correctness.
-		 */
-		unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
-		truncate_inode_pages(mapping, offset);
-		unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
-	}
-	return 0;
-out_sig:
-	send_sig(SIGXFSZ, current, 0);
-out_big:
-	return -EFBIG;
+	truncate_pagecache(inode, oldsize, offset);
+out:
+	return err;
 }
 
 /**
Index: linux-2.6/fs/ramfs/file-nommu.c
===================================================================
--- linux-2.6.orig/fs/ramfs/file-nommu.c
+++ linux-2.6/fs/ramfs/file-nommu.c
@@ -68,14 +68,11 @@ int ramfs_nommu_expand_for_mapping(struc
 	/* make various checks */
 	order = get_order(newsize);
 	if (unlikely(order >= MAX_ORDER))
-		goto too_big;
+		return -EFBIG;
 
-	limit = current->signal->rlim[RLIMIT_FSIZE].rlim_cur;
-	if (limit != RLIM_INFINITY && newsize > limit)
-		goto fsize_exceeded;
-
-	if (newsize > inode->i_sb->s_maxbytes)
-		goto too_big;
+	ret = inode_newsize_ok(inode, newsize);
+	if (ret)
+		return ret;
 
 	i_size_write(inode, newsize);
 
@@ -117,12 +114,7 @@ int ramfs_nommu_expand_for_mapping(struc
 
 	return 0;
 
- fsize_exceeded:
-	send_sig(SIGXFSZ, current, 0);
- too_big:
-	return -EFBIG;
-
- add_error:
+add_error:
 	while (loop < npages)
 		__free_page(pages + loop++);
 	return ret;
Index: linux-2.6/fs/fuse/fuse_i.h
===================================================================
--- linux-2.6.orig/fs/fuse/fuse_i.h
+++ linux-2.6/fs/fuse/fuse_i.h
@@ -588,8 +588,6 @@ void fuse_change_attributes(struct inode
 void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
 				   u64 attr_valid);
 
-void fuse_truncate(struct address_space *mapping, loff_t offset);
-
 /**
  * Initialize the client device
  */
Index: linux-2.6/fs/fuse/inode.c
===================================================================
--- linux-2.6.orig/fs/fuse/inode.c
+++ linux-2.6/fs/fuse/inode.c
@@ -115,14 +115,6 @@ static int fuse_remount_fs(struct super_
 	return 0;
 }
 
-void fuse_truncate(struct address_space *mapping, loff_t offset)
-{
-	/* See vmtruncate() */
-	unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
-	truncate_inode_pages(mapping, offset);
-	unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
-}
-
 void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
 				   u64 attr_valid)
 {
@@ -180,8 +172,7 @@ void fuse_change_attributes(struct inode
 	spin_unlock(&fc->lock);
 
 	if (S_ISREG(inode->i_mode) && oldsize != attr->size) {
-		if (attr->size < oldsize)
-			fuse_truncate(inode->i_mapping, attr->size);
+		truncate_pagecache(inode, oldsize, attr->size);
 		invalidate_inode_pages2(inode->i_mapping);
 	}
 }


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 2/5] fs: use new truncate helpers
@ 2009-07-10  7:30   ` npiggin
  0 siblings, 0 replies; 28+ messages in thread
From: npiggin @ 2009-07-10  7:30 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: hch, viro, jack, Miklos Szeredi, linux-nfs, Trond.Myklebust,
	linux-cifs-client, sfrench

Update some fs code to make use of new helper functions introduced
in the previous patch. Should be no significant change in behaviour
(except CIFS now calls send_sig under i_lock, via inode_newsize_ok).

Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: linux-nfs@vger.kernel.org
Cc: Trond.Myklebust@netapp.com
Cc: linux-cifs-client@lists.samba.org
Cc: sfrench@samba.org
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 fs/buffer.c           |   10 +--------
 fs/cifs/inode.c       |   51 ++++++++-----------------------------------------
 fs/fuse/dir.c         |   14 +++----------
 fs/fuse/fuse_i.h      |    2 -
 fs/fuse/inode.c       |   11 ----------
 fs/nfs/inode.c        |   52 +++++++++++---------------------------------------
 fs/ramfs/file-nommu.c |   18 ++++-------------
 7 files changed, 33 insertions(+), 125 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2225,16 +2225,10 @@ int generic_cont_expand_simple(struct in
 	struct address_space *mapping = inode->i_mapping;
 	struct page *page;
 	void *fsdata;
-	unsigned long limit;
 	int err;
 
-	err = -EFBIG;
-        limit = current->signal->rlim[RLIMIT_FSIZE].rlim_cur;
-	if (limit != RLIM_INFINITY && size > (loff_t)limit) {
-		send_sig(SIGXFSZ, current, 0);
-		goto out;
-	}
-	if (size > inode->i_sb->s_maxbytes)
+	err = inode_newsize_ok(inode, size);
+	if (err)
 		goto out;
 
 	err = pagecache_write_begin(NULL, mapping, size, 0,
Index: linux-2.6/fs/cifs/inode.c
===================================================================
--- linux-2.6.orig/fs/cifs/inode.c
+++ linux-2.6/fs/cifs/inode.c
@@ -1645,57 +1645,24 @@ static int cifs_truncate_page(struct add
 
 static int cifs_vmtruncate(struct inode *inode, loff_t offset)
 {
-	struct address_space *mapping = inode->i_mapping;
-	unsigned long limit;
+	loff_t oldsize;
+	int err;
 
 	spin_lock(&inode->i_lock);
-	if (inode->i_size < offset)
-		goto do_expand;
-	/*
-	 * truncation of in-use swapfiles is disallowed - it would cause
-	 * subsequent swapout to scribble on the now-freed blocks.
-	 */
-	if (IS_SWAPFILE(inode)) {
+	err = inode_newsize_ok(inode, offset);
+	if (err) {
 		spin_unlock(&inode->i_lock);
-		goto out_busy;
+		goto out;
 	}
-	i_size_write(inode, offset);
-	spin_unlock(&inode->i_lock);
-	/*
-	 * unmap_mapping_range is called twice, first simply for efficiency
-	 * so that truncate_inode_pages does fewer single-page unmaps. However
-	 * after this first call, and before truncate_inode_pages finishes,
-	 * it is possible for private pages to be COWed, which remain after
-	 * truncate_inode_pages finishes, hence the second unmap_mapping_range
-	 * call must be made for correctness.
-	 */
-	unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
-	truncate_inode_pages(mapping, offset);
-	unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
-	goto out_truncate;
 
-do_expand:
-	limit = current->signal->rlim[RLIMIT_FSIZE].rlim_cur;
-	if (limit != RLIM_INFINITY && offset > limit) {
-		spin_unlock(&inode->i_lock);
-		goto out_sig;
-	}
-	if (offset > inode->i_sb->s_maxbytes) {
-		spin_unlock(&inode->i_lock);
-		goto out_big;
-	}
+	oldsize = inode->i_size;
 	i_size_write(inode, offset);
 	spin_unlock(&inode->i_lock);
-out_truncate:
+	truncate_pagecache(inode, oldsize, offset);
 	if (inode->i_op->truncate)
 		inode->i_op->truncate(inode);
-	return 0;
-out_sig:
-	send_sig(SIGXFSZ, current, 0);
-out_big:
-	return -EFBIG;
-out_busy:
-	return -ETXTBSY;
+out:
+	return err;
 }
 
 static int
Index: linux-2.6/fs/fuse/dir.c
===================================================================
--- linux-2.6.orig/fs/fuse/dir.c
+++ linux-2.6/fs/fuse/dir.c
@@ -1225,14 +1225,9 @@ static int fuse_do_setattr(struct dentry
 		return 0;
 
 	if (attr->ia_valid & ATTR_SIZE) {
-		unsigned long limit;
-		if (IS_SWAPFILE(inode))
-			return -ETXTBSY;
-		limit = current->signal->rlim[RLIMIT_FSIZE].rlim_cur;
-		if (limit != RLIM_INFINITY && attr->ia_size > (loff_t) limit) {
-			send_sig(SIGXFSZ, current, 0);
-			return -EFBIG;
-		}
+		err = inode_newsize_ok(inode, attr->ia_size);
+		if (err)
+			return err;
 		is_truncate = true;
 	}
 
@@ -1299,8 +1294,7 @@ static int fuse_do_setattr(struct dentry
 	 * FUSE_NOWRITE, otherwise fuse_launder_page() would deadlock.
 	 */
 	if (S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
-		if (outarg.attr.size < oldsize)
-			fuse_truncate(inode->i_mapping, outarg.attr.size);
+		truncate_pagecache(inode, oldsize, outarg.attr.size);
 		invalidate_inode_pages2(inode->i_mapping);
 	}
 
Index: linux-2.6/fs/nfs/inode.c
===================================================================
--- linux-2.6.orig/fs/nfs/inode.c
+++ linux-2.6/fs/nfs/inode.c
@@ -427,49 +427,21 @@ nfs_setattr(struct dentry *dentry, struc
  */
 static int nfs_vmtruncate(struct inode * inode, loff_t offset)
 {
-	if (i_size_read(inode) < offset) {
-		unsigned long limit;
+	loff_t oldsize;
+	int err;
 
-		limit = current->signal->rlim[RLIMIT_FSIZE].rlim_cur;
-		if (limit != RLIM_INFINITY && offset > limit)
-			goto out_sig;
-		if (offset > inode->i_sb->s_maxbytes)
-			goto out_big;
-		spin_lock(&inode->i_lock);
-		i_size_write(inode, offset);
-		spin_unlock(&inode->i_lock);
-	} else {
-		struct address_space *mapping = inode->i_mapping;
+	err = inode_newsize_ok(inode, offset);
+	if (err)
+		goto out;
 
-		/*
-		 * truncation of in-use swapfiles is disallowed - it would
-		 * cause subsequent swapout to scribble on the now-freed
-		 * blocks.
-		 */
-		if (IS_SWAPFILE(inode))
-			return -ETXTBSY;
-		spin_lock(&inode->i_lock);
-		i_size_write(inode, offset);
-		spin_unlock(&inode->i_lock);
+	spin_lock(&inode->i_lock);
+	oldsize = inode->i_size;
+	i_size_write(inode, offset);
+	spin_unlock(&inode->i_lock);
 
-		/*
-		 * unmap_mapping_range is called twice, first simply for
-		 * efficiency so that truncate_inode_pages does fewer
-		 * single-page unmaps.  However after this first call, and
-		 * before truncate_inode_pages finishes, it is possible for
-		 * private pages to be COWed, which remain after
-		 * truncate_inode_pages finishes, hence the second
-		 * unmap_mapping_range call must be made for correctness.
-		 */
-		unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
-		truncate_inode_pages(mapping, offset);
-		unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
-	}
-	return 0;
-out_sig:
-	send_sig(SIGXFSZ, current, 0);
-out_big:
-	return -EFBIG;
+	truncate_pagecache(inode, oldsize, offset);
+out:
+	return err;
 }
 
 /**
Index: linux-2.6/fs/ramfs/file-nommu.c
===================================================================
--- linux-2.6.orig/fs/ramfs/file-nommu.c
+++ linux-2.6/fs/ramfs/file-nommu.c
@@ -68,14 +68,11 @@ int ramfs_nommu_expand_for_mapping(struc
 	/* make various checks */
 	order = get_order(newsize);
 	if (unlikely(order >= MAX_ORDER))
-		goto too_big;
+		return -EFBIG;
 
-	limit = current->signal->rlim[RLIMIT_FSIZE].rlim_cur;
-	if (limit != RLIM_INFINITY && newsize > limit)
-		goto fsize_exceeded;
-
-	if (newsize > inode->i_sb->s_maxbytes)
-		goto too_big;
+	ret = inode_newsize_ok(inode, newsize);
+	if (ret)
+		return ret;
 
 	i_size_write(inode, newsize);
 
@@ -117,12 +114,7 @@ int ramfs_nommu_expand_for_mapping(struc
 
 	return 0;
 
- fsize_exceeded:
-	send_sig(SIGXFSZ, current, 0);
- too_big:
-	return -EFBIG;
-
- add_error:
+add_error:
 	while (loop < npages)
 		__free_page(pages + loop++);
 	return ret;
Index: linux-2.6/fs/fuse/fuse_i.h
===================================================================
--- linux-2.6.orig/fs/fuse/fuse_i.h
+++ linux-2.6/fs/fuse/fuse_i.h
@@ -588,8 +588,6 @@ void fuse_change_attributes(struct inode
 void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
 				   u64 attr_valid);
 
-void fuse_truncate(struct address_space *mapping, loff_t offset);
-
 /**
  * Initialize the client device
  */
Index: linux-2.6/fs/fuse/inode.c
===================================================================
--- linux-2.6.orig/fs/fuse/inode.c
+++ linux-2.6/fs/fuse/inode.c
@@ -115,14 +115,6 @@ static int fuse_remount_fs(struct super_
 	return 0;
 }
 
-void fuse_truncate(struct address_space *mapping, loff_t offset)
-{
-	/* See vmtruncate() */
-	unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
-	truncate_inode_pages(mapping, offset);
-	unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
-}
-
 void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
 				   u64 attr_valid)
 {
@@ -180,8 +172,7 @@ void fuse_change_attributes(struct inode
 	spin_unlock(&fc->lock);
 
 	if (S_ISREG(inode->i_mode) && oldsize != attr->size) {
-		if (attr->size < oldsize)
-			fuse_truncate(inode->i_mapping, attr->size);
+		truncate_pagecache(inode, oldsize, attr->size);
 		invalidate_inode_pages2(inode->i_mapping);
 	}
 }



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 3/5] fs: introduce new truncate sequence
  2009-07-10  7:30 [patch 0/5] new truncate sequence patchset npiggin
  2009-07-10  7:30 ` [patch 1/5] fs: new truncate helpers npiggin
  2009-07-10  7:30   ` npiggin
@ 2009-07-10  7:30 ` npiggin
  2009-07-10  7:30 ` [patch 4/5] tmpfs: convert to use the new truncate convention npiggin
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 28+ messages in thread
From: npiggin @ 2009-07-10  7:30 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: hch, viro, jack

[-- Attachment #1: fs-new-truncate.patch --]
[-- Type: text/plain, Size: 8305 bytes --]

Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
setattr > vmtruncate > truncate, have filesystems call their truncate sequence
from ->setattr if filesystem specific operations are required. vmtruncate is
deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
previously should be used.

simple_setsize is also introduced to perform the equivalent of vmtruncate.
simple_setsize gets called by inode_setattr when ATTR_SIZE is passed. So
filesystems implementing their own truncate code in setattr then calling
through to inode_setattr should clear ATTR_SIZE.

A new attribute is introduced into inode_operations structure; .new_truncate
is a temporary hack to distinguish filesystems that implement the new
truncate system. These guys cannot trim off block past i_size via vmtruncate,
so instead they must handle it in fs code. This gives better opportunity to
catch errors etc anyway. .new_truncate and .truncate will go away once all
filesystems are converted.

Big problem with the previous calling sequence: the filesystem is not called
until i_size has already changed.  This means it is not allowed to fail the
call, and also it does not know what the previous i_size was. Also, generic
code calling vmtruncate to truncate allocated blocks in case of error had
no good way to return a meaningful error (or, for example, atomically handle
block deallocation).

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 Documentation/filesystems/vfs.txt |    7 ++++++-
 fs/attr.c                         |    7 ++++++-
 fs/buffer.c                       |   12 +++++++++---
 fs/direct-io.c                    |    7 ++++---
 fs/libfs.c                        |   17 +++++++++++++++++
 include/linux/fs.h                |    2 ++
 mm/truncate.c                     |    6 ++----
 7 files changed, 46 insertions(+), 12 deletions(-)

Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -329,6 +329,22 @@ int simple_rename(struct inode *old_dir,
 	return 0;
 }
 
+int simple_setsize(struct inode *inode, loff_t newsize)
+{
+	loff_t oldsize;
+	int error;
+
+	error = inode_newsize_ok(inode, newsize);
+	if (error)
+		return error;
+
+	oldsize = inode->i_size;
+	i_size_write(inode, newsize);
+	truncate_pagecache(inode, oldsize, newsize);
+
+	return error;
+}
+
 int simple_readpage(struct file *file, struct page *page)
 {
 	clear_highpage(page);
@@ -840,6 +856,7 @@ EXPORT_SYMBOL(generic_read_dir);
 EXPORT_SYMBOL(get_sb_pseudo);
 EXPORT_SYMBOL(simple_write_begin);
 EXPORT_SYMBOL(simple_write_end);
+EXPORT_SYMBOL(simple_setsize);
 EXPORT_SYMBOL(simple_dir_inode_operations);
 EXPORT_SYMBOL(simple_dir_operations);
 EXPORT_SYMBOL(simple_empty);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -1527,6 +1527,7 @@ struct inode_operations {
 	void * (*follow_link) (struct dentry *, struct nameidata *);
 	void (*put_link) (struct dentry *, struct nameidata *, void *);
 	void (*truncate) (struct inode *);
+	int new_truncate; /* nasty hack to transition to new truncate code */
 	int (*permission) (struct inode *, int);
 	int (*setattr) (struct dentry *, struct iattr *);
 	int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
@@ -2332,6 +2333,7 @@ extern int simple_link(struct dentry *,
 extern int simple_unlink(struct inode *, struct dentry *);
 extern int simple_rmdir(struct inode *, struct dentry *);
 extern int simple_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
+extern int simple_setsize(struct inode *inode, loff_t newsize);
 extern int simple_sync_file(struct file *, struct dentry *, int);
 extern int simple_empty(struct dentry *);
 extern int simple_readpage(struct file *file, struct page *page);
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -1992,9 +1992,14 @@ int block_write_begin(struct file *file,
 			 * prepare_write() may have instantiated a few blocks
 			 * outside i_size.  Trim these off again. Don't need
 			 * i_size_read because we hold i_mutex.
+			 *
+			 * Filesystems which set i_op->new_truncate must
+			 * handle this themselves. Eventually this will go
+			 * away because everyone will be converted.
 			 */
 			if (pos + len > inode->i_size)
-				vmtruncate(inode, inode->i_size);
+				if (!inode->i_op->new_truncate)
+					vmtruncate(inode, inode->i_size);
 		}
 	}
 
@@ -2371,7 +2376,7 @@ int block_commit_write(struct page *page
  *
  * We are not allowed to take the i_mutex here so we have to play games to
  * protect against truncate races as the page could now be beyond EOF.  Because
- * vmtruncate() writes the inode size before removing pages, once we have the
+ * truncate writes the inode size before removing pages, once we have the
  * page lock we can determine safely if the page is beyond EOF. If it is not
  * beyond EOF, then the page is guaranteed safe against truncation until we
  * unlock the page.
@@ -2595,7 +2600,8 @@ out_release:
 	*pagep = NULL;
 
 	if (pos + len > inode->i_size)
-		vmtruncate(inode, inode->i_size);
+		if (!inode->i_op->new_truncate)
+			vmtruncate(inode, inode->i_size);
 
 	return ret;
 }
Index: linux-2.6/fs/direct-io.c
===================================================================
--- linux-2.6.orig/fs/direct-io.c
+++ linux-2.6/fs/direct-io.c
@@ -1210,14 +1210,15 @@ __blockdev_direct_IO(int rw, struct kioc
 	/*
 	 * In case of error extending write may have instantiated a few
 	 * blocks outside i_size. Trim these off again for DIO_LOCKING.
-	 * NOTE: DIO_NO_LOCK/DIO_OWN_LOCK callers have to handle this by
-	 * it's own meaner.
+	 * NOTE: DIO_NO_LOCK/DIO_OWN_LOCK callers have to handle this in
+	 * their own manner.
 	 */
 	if (unlikely(retval < 0 && (rw & WRITE))) {
 		loff_t isize = i_size_read(inode);
 
 		if (end > isize && dio_lock_type == DIO_LOCKING)
-			vmtruncate(inode, isize);
+			if (!inode->i_op->new_truncate)
+				vmtruncate(inode, isize);
 	}
 
 	if (rw == READ && dio_lock_type == DIO_LOCKING)
Index: linux-2.6/fs/attr.c
===================================================================
--- linux-2.6.orig/fs/attr.c
+++ linux-2.6/fs/attr.c
@@ -111,7 +111,12 @@ int inode_setattr(struct inode * inode,
 
 	if (ia_valid & ATTR_SIZE &&
 	    attr->ia_size != i_size_read(inode)) {
-		int error = vmtruncate(inode, attr->ia_size);
+		int error;
+
+		if (inode->i_op->new_truncate)
+			error = simple_setsize(inode, attr->ia_size);
+		else
+			error = vmtruncate(inode, attr->ia_size);
 		if (error)
 			return error;
 	}
Index: linux-2.6/Documentation/filesystems/vfs.txt
===================================================================
--- linux-2.6.orig/Documentation/filesystems/vfs.txt
+++ linux-2.6/Documentation/filesystems/vfs.txt
@@ -401,11 +401,16 @@ otherwise noted.
   	started might not be in the page cache at the end of the
   	walk).
 
-  truncate: called by the VFS to change the size of a file.  The
+  truncate: Deprecated. This will not be called if ->setsize is defined.
+	Called by the VFS to change the size of a file.  The
  	i_size field of the inode is set to the desired size by the
  	VFS before this method is called.  This method is called by
  	the truncate(2) system call and related functionality.
 
+	Note: ->truncate and vmtruncate are deprecated. Do not add new
+	instances/calls of these. Filesystems shoud be converted to do their
+	truncate sequence via ->setattr().
+
   permission: called by the VFS to check for access rights on a POSIX-like
   	filesystem.
 
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -520,12 +520,10 @@ int vmtruncate(struct inode * inode, lof
 	loff_t oldsize;
 	int error;
 
-	error = inode_newsize_ok(inode, offset);
+	error = simple_setsize(inode, offset);
 	if (error)
 		return error;
-	oldsize = inode->i_size;
-	i_size_write(inode, offset);
-	truncate_pagecache(inode, oldsize, offset);
+
 	if (inode->i_op->truncate)
 		inode->i_op->truncate(inode);
 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 4/5] tmpfs: convert to use the new truncate convention
  2009-07-10  7:30 [patch 0/5] new truncate sequence patchset npiggin
                   ` (2 preceding siblings ...)
  2009-07-10  7:30 ` [patch 3/5] fs: introduce new truncate sequence npiggin
@ 2009-07-10  7:30 ` npiggin
  2009-07-10  7:30 ` [patch 5/5] ext2: " npiggin
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 28+ messages in thread
From: npiggin @ 2009-07-10  7:30 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: hch, viro, jack, hugh.dickins

[-- Attachment #1: tmpfs-new-truncate.patch --]
[-- Type: text/plain, Size: 3933 bytes --]

Cc: hugh.dickins@tiscali.co.uk
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 fs/ext2/ext2.h  |    1 
 fs/ext2/file.c  |    2 
 fs/ext2/inode.c |  137 +++++++++++++++++++++++++++++++++++++++++++++-----------
 mm/shmem.c      |   35 +++++++-------
 4 files changed, 131 insertions(+), 44 deletions(-)

Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -730,10 +730,11 @@ done2:
 	if (inode->i_mapping->nrpages && (info->flags & SHMEM_PAGEIN)) {
 		/*
 		 * Call truncate_inode_pages again: racing shmem_unuse_inode
-		 * may have swizzled a page in from swap since vmtruncate or
-		 * generic_delete_inode did it, before we lowered next_index.
-		 * Also, though shmem_getpage checks i_size before adding to
-		 * cache, no recheck after: so fix the narrow window there too.
+		 * may have swizzled a page in from swap since
+		 * truncate_pagecache or generic_delete_inode did it, before we
+		 * lowered next_index.  Also, though shmem_getpage checks
+		 * i_size before adding to cache, no recheck after: so fix the
+		 * narrow window there too.
 		 *
 		 * Recalling truncate_inode_pages_range and unmap_mapping_range
 		 * every time for punch_hole (which never got a chance to clear
@@ -763,11 +764,6 @@ done2:
 	}
 }
 
-static void shmem_truncate(struct inode *inode)
-{
-	shmem_truncate_range(inode, inode->i_size, (loff_t)-1);
-}
-
 static int shmem_notify_change(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = dentry->d_inode;
@@ -775,6 +771,8 @@ static int shmem_notify_change(struct de
 	int error;
 
 	if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
+		loff_t newsize = attr->ia_size;
+
 		if (attr->ia_size < inode->i_size) {
 			/*
 			 * If truncating down to a partial page, then
@@ -783,9 +781,9 @@ static int shmem_notify_change(struct de
 			 * truncate_partial_page cannnot miss it were
 			 * it assigned to swap.
 			 */
-			if (attr->ia_size & (PAGE_CACHE_SIZE-1)) {
+			if (newsize & (PAGE_CACHE_SIZE-1)) {
 				(void) shmem_getpage(inode,
-					attr->ia_size>>PAGE_CACHE_SHIFT,
+					newsize >> PAGE_CACHE_SHIFT,
 						&page, SGP_READ, NULL);
 				if (page)
 					unlock_page(page);
@@ -797,13 +795,19 @@ static int shmem_notify_change(struct de
 			 * if it's being fully truncated to zero-length: the
 			 * nrpages check is efficient enough in that case.
 			 */
-			if (attr->ia_size) {
+			if (newsize) {
 				struct shmem_inode_info *info = SHMEM_I(inode);
 				spin_lock(&info->lock);
 				info->flags &= ~SHMEM_PAGEIN;
 				spin_unlock(&info->lock);
 			}
 		}
+
+		error = simple_setsize(inode, newsize);
+		if (error)
+			return error;
+		shmem_truncate_range(inode, newsize, (loff_t)-1);
+		attr->ia_valid &= ~ATTR_SIZE;
 	}
 
 	error = inode_change_ok(inode, attr);
@@ -822,11 +826,11 @@ static void shmem_delete_inode(struct in
 {
 	struct shmem_inode_info *info = SHMEM_I(inode);
 
-	if (inode->i_op->truncate == shmem_truncate) {
+	if (inode->i_mapping->a_ops == &shmem_aops) {
 		truncate_inode_pages(inode->i_mapping, 0);
 		shmem_unacct_size(info->flags, inode->i_size);
 		inode->i_size = 0;
-		shmem_truncate(inode);
+		shmem_truncate_range(inode, 0, (loff_t)-1);
 		if (!list_empty(&info->swaplist)) {
 			mutex_lock(&shmem_swaplist_mutex);
 			list_del_init(&info->swaplist);
@@ -2018,7 +2022,6 @@ static const struct inode_operations shm
 };
 
 static const struct inode_operations shmem_symlink_inode_operations = {
-	.truncate	= shmem_truncate,
 	.readlink	= generic_readlink,
 	.follow_link	= shmem_follow_link,
 	.put_link	= shmem_put_link,
@@ -2438,7 +2441,7 @@ static const struct file_operations shme
 };
 
 static const struct inode_operations shmem_inode_operations = {
-	.truncate	= shmem_truncate,
+	.new_truncate	= 1,
 	.setattr	= shmem_notify_change,
 	.truncate_range	= shmem_truncate_range,
 #ifdef CONFIG_TMPFS_POSIX_ACL



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 5/5] ext2: convert to use the new truncate convention.
  2009-07-10  7:30 [patch 0/5] new truncate sequence patchset npiggin
                   ` (3 preceding siblings ...)
  2009-07-10  7:30 ` [patch 4/5] tmpfs: convert to use the new truncate convention npiggin
@ 2009-07-10  7:30 ` npiggin
  2009-09-01 18:29   ` Jan Kara
  2009-07-10  9:33   ` Nick Piggin
  2009-07-10 14:31 ` [patch 0/5] new truncate sequence patchset Christoph Hellwig
  6 siblings, 1 reply; 28+ messages in thread
From: npiggin @ 2009-07-10  7:30 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: hch, viro, jack, linux-ext4

[-- Attachment #1: ext2-new-truncate.patch --]
[-- Type: text/plain, Size: 8218 bytes --]

I have some questions, marked with XXX.

Cc: linux-ext4@vger.kernel.org
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 fs/ext2/ext2.h  |    1 
 fs/ext2/file.c  |    2 
 fs/ext2/inode.c |  138 +++++++++++++++++++++++++++++++++++++++++++++-----------
 3 files changed, 113 insertions(+), 28 deletions(-)

Index: linux-2.6/fs/ext2/inode.c
===================================================================
--- linux-2.6.orig/fs/ext2/inode.c
+++ linux-2.6/fs/ext2/inode.c
@@ -53,6 +53,8 @@ static inline int ext2_inode_is_fast_sym
 		inode->i_blocks - ea_blocks == 0);
 }
 
+static void ext2_truncate_blocks(struct inode *inode, loff_t offset);
+
 /*
  * Called at the last iput() if i_nlink is zero.
  */
@@ -66,9 +68,10 @@ void ext2_delete_inode (struct inode * i
 	mark_inode_dirty(inode);
 	ext2_write_inode(inode, inode_needs_sync(inode));
 
+	/* XXX: where is truncate_inode_pages? */
 	inode->i_size = 0;
 	if (inode->i_blocks)
-		ext2_truncate (inode);
+		ext2_truncate_blocks(inode, 0);
 	ext2_free_inode (inode);
 
 	return;
@@ -761,8 +764,33 @@ ext2_write_begin(struct file *file, stru
 		loff_t pos, unsigned len, unsigned flags,
 		struct page **pagep, void **fsdata)
 {
+	int ret;
+
 	*pagep = NULL;
-	return __ext2_write_begin(file, mapping, pos, len, flags, pagep,fsdata);
+	ret = __ext2_write_begin(file, mapping, pos, len, flags, pagep,fsdata);
+	if (ret < 0) {
+		struct inode *inode = mapping->host;
+		loff_t isize = inode->i_size;
+		if (pos + len > isize)
+			ext2_truncate_blocks(inode, isize);
+	}
+	return ret;
+}
+
+static int ext2_write_end(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned copied,
+			struct page *page, void *fsdata)
+{
+	int ret;
+
+	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
+	if (ret < len) {
+		struct inode *inode = mapping->host;
+		loff_t isize = inode->i_size;
+		if (pos + len > isize)
+			ext2_truncate_blocks(inode, isize);
+	}
+	return ret;
 }
 
 static int
@@ -770,13 +798,22 @@ ext2_nobh_write_begin(struct file *file,
 		loff_t pos, unsigned len, unsigned flags,
 		struct page **pagep, void **fsdata)
 {
+	int ret;
+
 	/*
 	 * Dir-in-pagecache still uses ext2_write_begin. Would have to rework
 	 * directory handling code to pass around offsets rather than struct
 	 * pages in order to make this work easily.
 	 */
-	return nobh_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+	ret = nobh_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
 							ext2_get_block);
+	if (ret < 0) {
+		struct inode *inode = mapping->host;
+		loff_t isize = inode->i_size;
+		if (pos + len > isize)
+			ext2_truncate_blocks(inode, isize);
+	}
+	return ret;
 }
 
 static int ext2_nobh_writepage(struct page *page,
@@ -796,9 +833,15 @@ ext2_direct_IO(int rw, struct kiocb *ioc
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
+	ssize_t ret;
 
-	return blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
+	ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
 				offset, nr_segs, ext2_get_block, NULL);
+	if (ret < 0 && (rw & WRITE)) {
+		loff_t isize = inode->i_size;
+		ext2_truncate_blocks(inode, isize);
+	}
+	return ret;
 }
 
 static int
@@ -813,7 +856,7 @@ const struct address_space_operations ex
 	.writepage		= ext2_writepage,
 	.sync_page		= block_sync_page,
 	.write_begin		= ext2_write_begin,
-	.write_end		= generic_write_end,
+	.write_end		= ext2_write_end,
 	.bmap			= ext2_bmap,
 	.direct_IO		= ext2_direct_IO,
 	.writepages		= ext2_writepages,
@@ -1020,7 +1063,7 @@ static void ext2_free_branches(struct in
 		ext2_free_data(inode, p, q);
 }
 
-void ext2_truncate(struct inode *inode)
+static void __ext2_truncate_blocks(struct inode *inode, loff_t offset)
 {
 	__le32 *i_data = EXT2_I(inode)->i_data;
 	struct ext2_inode_info *ei = EXT2_I(inode);
@@ -1032,27 +1075,8 @@ void ext2_truncate(struct inode *inode)
 	int n;
 	long iblock;
 	unsigned blocksize;
-
-	if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
-	    S_ISLNK(inode->i_mode)))
-		return;
-	if (ext2_inode_is_fast_symlink(inode))
-		return;
-	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
-		return;
-
 	blocksize = inode->i_sb->s_blocksize;
-	iblock = (inode->i_size + blocksize-1)
-					>> EXT2_BLOCK_SIZE_BITS(inode->i_sb);
-
-	if (mapping_is_xip(inode->i_mapping))
-		xip_truncate_page(inode->i_mapping, inode->i_size);
-	else if (test_opt(inode->i_sb, NOBH))
-		nobh_truncate_page(inode->i_mapping,
-				inode->i_size, ext2_get_block);
-	else
-		block_truncate_page(inode->i_mapping,
-				inode->i_size, ext2_get_block);
+	iblock = (offset + blocksize-1) >> EXT2_BLOCK_SIZE_BITS(inode->i_sb);
 
 	n = ext2_block_to_path(inode, iblock, offsets, NULL);
 	if (n == 0)
@@ -1120,6 +1144,59 @@ do_indirects:
 	ext2_discard_reservation(inode);
 
 	mutex_unlock(&ei->truncate_mutex);
+}
+
+static void ext2_truncate_blocks(struct inode *inode, loff_t offset)
+{
+	/*
+	 * XXX: it seems like a bug here that we don't allow
+	 * IS_APPEND inode to have blocks-past-i_size trimmed off.
+	 * review and fix this.
+	 */
+	if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
+	    S_ISLNK(inode->i_mode)))
+		return -EINVAL;
+	if (ext2_inode_is_fast_symlink(inode))
+		return -EINVAL;
+	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
+		return -EPERM;
+	__ext2_truncate_blocks(inode, offset);
+}
+
+int ext2_setsize(struct inode *inode, loff_t newsize)
+{
+	loff_t oldsize;
+	int error;
+
+	error = inode_newsize_ok(inode, newsize);
+	if (error)
+		return error;
+
+	if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
+	    S_ISLNK(inode->i_mode)))
+		return -EINVAL;
+	if (ext2_inode_is_fast_symlink(inode))
+		return -EINVAL;
+	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
+		return -EPERM;
+
+	if (mapping_is_xip(inode->i_mapping))
+		error = xip_truncate_page(inode->i_mapping, newsize);
+	else if (test_opt(inode->i_sb, NOBH))
+		error = nobh_truncate_page(inode->i_mapping,
+				newsize, ext2_get_block);
+	else
+		error = block_truncate_page(inode->i_mapping,
+				newsize, ext2_get_block);
+	if (error)
+		return error;
+
+	oldsize = inode->i_size;
+	i_size_write(inode, newsize);
+	truncate_pagecache(inode, oldsize, newsize);
+
+	__ext2_truncate_blocks(inode, newsize);
+
 	inode->i_mtime = inode->i_ctime = CURRENT_TIME_SEC;
 	if (inode_needs_sync(inode)) {
 		sync_mapping_buffers(inode->i_mapping);
@@ -1127,6 +1204,8 @@ do_indirects:
 	} else {
 		mark_inode_dirty(inode);
 	}
+
+	return 0;
 }
 
 static struct ext2_inode *ext2_get_inode(struct super_block *sb, ino_t ino,
@@ -1459,6 +1538,13 @@ int ext2_setattr(struct dentry *dentry,
 		if (error)
 			return error;
 	}
+	if (iattr->ia_valid & ATTR_SIZE) {
+		error = ext2_setsize(inode, iattr->ia_size);
+		if (error)
+			return error;
+		iattr->ia_valid &= ~ATTR_SIZE;
+		/* strip ATTR_SIZE for inode_setattr */
+	}
 	error = inode_setattr(inode, iattr);
 	if (!error && (iattr->ia_valid & ATTR_MODE))
 		error = ext2_acl_chmod(inode);
Index: linux-2.6/fs/ext2/ext2.h
===================================================================
--- linux-2.6.orig/fs/ext2/ext2.h
+++ linux-2.6/fs/ext2/ext2.h
@@ -122,7 +122,6 @@ extern int ext2_write_inode (struct inod
 extern void ext2_delete_inode (struct inode *);
 extern int ext2_sync_inode (struct inode *);
 extern int ext2_get_block(struct inode *, sector_t, struct buffer_head *, int);
-extern void ext2_truncate (struct inode *);
 extern int ext2_setattr (struct dentry *, struct iattr *);
 extern void ext2_set_inode_flags(struct inode *inode);
 extern void ext2_get_inode_flags(struct ext2_inode_info *);
Index: linux-2.6/fs/ext2/file.c
===================================================================
--- linux-2.6.orig/fs/ext2/file.c
+++ linux-2.6/fs/ext2/file.c
@@ -77,13 +77,13 @@ const struct file_operations ext2_xip_fi
 #endif
 
 const struct inode_operations ext2_file_inode_operations = {
-	.truncate	= ext2_truncate,
 #ifdef CONFIG_EXT2_FS_XATTR
 	.setxattr	= generic_setxattr,
 	.getxattr	= generic_getxattr,
 	.listxattr	= ext2_listxattr,
 	.removexattr	= generic_removexattr,
 #endif
+	.new_truncate	= 1,
 	.setattr	= ext2_setattr,
 	.permission	= ext2_permission,
 	.fiemap		= ext2_fiemap,



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 1/3] fs: buffer_head writepage no invalidate
  2009-07-10  7:30 [patch 0/5] new truncate sequence patchset npiggin
@ 2009-07-10  9:33   ` Nick Piggin
  2009-07-10  7:30   ` npiggin
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 28+ messages in thread
From: Nick Piggin @ 2009-07-10  9:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: hch, viro, jack, linux-mm

After the previous patchset, this is my progress on the page_mkwrite
thing... These patches are RFC only and have some bugs.
--
invalidate should not be required in the writeout path. The truncate
sequence will first reduce i_size, then clean and discard any existing
pagecache (and no new dirty pagecache can be added because i_size was
reduced and i_mutex is being held), then filesystem data structures
are updated.

Filesystem needs to be able to handle writeout at any point before
the last step, and once the 2nd step completes, there should be no
unfreeable dirty buffers anyway (truncate performs the do_invalidatepage).

Having filesystem changes depend on reading i_size without holding
i_mutex is confusing at least. There is still a case in writepage
paths in buffer.c uses i_size (testing which block to write out), but
this is a small improvement.
---
 fs/buffer.c |   20 ++------------------
 1 file changed, 2 insertions(+), 18 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2663,18 +2663,8 @@ int nobh_writepage(struct page *page, ge
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_CACHE_SIZE-1);
 	if (page->index >= end_index+1 || !offset) {
-		/*
-		 * The page may have dirty, unmapped buffers.  For example,
-		 * they may have been added in ext3_writepage().  Make them
-		 * freeable here, so the page does not leak.
-		 */
-#if 0
-		/* Not really sure about this  - do we need this ? */
-		if (page->mapping->a_ops->invalidatepage)
-			page->mapping->a_ops->invalidatepage(page, offset);
-#endif
 		unlock_page(page);
-		return 0; /* don't care */
+		return 0;
 	}
 
 	/*
@@ -2867,14 +2857,8 @@ int block_write_full_page_endio(struct p
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_CACHE_SIZE-1);
 	if (page->index >= end_index+1 || !offset) {
-		/*
-		 * The page may have dirty, unmapped buffers.  For example,
-		 * they may have been added in ext3_writepage().  Make them
-		 * freeable here, so the page does not leak.
-		 */
-		do_invalidatepage(page, 0);
 		unlock_page(page);
-		return 0; /* don't care */
+		return 0;
 	}
 
 	/*

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 1/3] fs: buffer_head writepage no invalidate
@ 2009-07-10  9:33   ` Nick Piggin
  0 siblings, 0 replies; 28+ messages in thread
From: Nick Piggin @ 2009-07-10  9:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: hch, viro, jack, linux-mm

After the previous patchset, this is my progress on the page_mkwrite
thing... These patches are RFC only and have some bugs.
--
invalidate should not be required in the writeout path. The truncate
sequence will first reduce i_size, then clean and discard any existing
pagecache (and no new dirty pagecache can be added because i_size was
reduced and i_mutex is being held), then filesystem data structures
are updated.

Filesystem needs to be able to handle writeout at any point before
the last step, and once the 2nd step completes, there should be no
unfreeable dirty buffers anyway (truncate performs the do_invalidatepage).

Having filesystem changes depend on reading i_size without holding
i_mutex is confusing at least. There is still a case in writepage
paths in buffer.c uses i_size (testing which block to write out), but
this is a small improvement.
---
 fs/buffer.c |   20 ++------------------
 1 file changed, 2 insertions(+), 18 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2663,18 +2663,8 @@ int nobh_writepage(struct page *page, ge
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_CACHE_SIZE-1);
 	if (page->index >= end_index+1 || !offset) {
-		/*
-		 * The page may have dirty, unmapped buffers.  For example,
-		 * they may have been added in ext3_writepage().  Make them
-		 * freeable here, so the page does not leak.
-		 */
-#if 0
-		/* Not really sure about this  - do we need this ? */
-		if (page->mapping->a_ops->invalidatepage)
-			page->mapping->a_ops->invalidatepage(page, offset);
-#endif
 		unlock_page(page);
-		return 0; /* don't care */
+		return 0;
 	}
 
 	/*
@@ -2867,14 +2857,8 @@ int block_write_full_page_endio(struct p
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_CACHE_SIZE-1);
 	if (page->index >= end_index+1 || !offset) {
-		/*
-		 * The page may have dirty, unmapped buffers.  For example,
-		 * they may have been added in ext3_writepage().  Make them
-		 * freeable here, so the page does not leak.
-		 */
-		do_invalidatepage(page, 0);
 		unlock_page(page);
-		return 0; /* don't care */
+		return 0;
 	}
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 2/3] fs: buffer_head writepage no zero
  2009-07-10  9:33   ` Nick Piggin
@ 2009-07-10  9:34     ` Nick Piggin
  -1 siblings, 0 replies; 28+ messages in thread
From: Nick Piggin @ 2009-07-10  9:34 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: hch, viro, jack, linux-mm


When writing a page to filesystem, buffer.c zeroes out parts of the page past
i_size in an attempt to get zeroes into those blocks on disk, so as to honour
the requirement that an expanding truncate should zero-fill the file.

Unfortunately, this is racy. The reason we can get something other than
zeroes here is via an mmaped write to the block beyond i_size. Zeroing it
out before writepage narrows the window, but it is still possible to store
junk beyond i_size on disk, by storing into the page after writepage zeroes,
but before DMA (or copy) completes. This allows process A to break posix
semantics for process B (or even inadvertently for itsef).

It could also be possible that the filesystem has written data into the
block but not yet expanded the inode size when the system crashes for
some reason. Unless its journal reply / fsck process etc checks for this
condition, it could also cause subsequent breakage in semantics.

---
 fs/buffer.c      |   94 +++++++++++++++++++++++++------------------------------
 fs/ext2/inode.c  |   30 ++++++++++++++++-
 fs/mpage.c       |   13 +------
 mm/filemap_xip.c |   15 ++++----
 mm/truncate.c    |    1 
 5 files changed, 82 insertions(+), 71 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2656,26 +2656,14 @@ int nobh_writepage(struct page *page, ge
 	unsigned offset;
 	int ret;
 
-	/* Is the page fully inside i_size? */
-	if (page->index < end_index)
-		goto out;
-
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_CACHE_SIZE-1);
-	if (page->index >= end_index+1 || !offset) {
+	if (page->index >= end_index &&
+			(page->index >= end_index+1 || !offset)) {
 		unlock_page(page);
 		return 0;
 	}
 
-	/*
-	 * The page straddles i_size.  It must be zeroed out on each and every
-	 * writepage invocation because it may be mmapped.  "A file is mapped
-	 * in multiples of the page size.  For a file that is not a multiple of
-	 * the  page size, the remaining memory is zeroed when mapped, and
-	 * writes to that region are not written out to the file."
-	 */
-	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
-out:
 	ret = mpage_writepage(page, get_block, wbc);
 	if (ret == -EAGAIN)
 		ret = __block_write_full_page(inode, page, get_block, wbc,
@@ -2695,22 +2683,23 @@ int nobh_truncate_page(struct address_sp
 	struct inode *inode = mapping->host;
 	struct page *page;
 	struct buffer_head map_bh;
-	int err;
+	int err = 0;
 
 	blocksize = 1 << inode->i_blkbits;
 	length = offset & (blocksize - 1);
 
 	/* Block boundary? Nothing to do */
 	if (!length)
-		return 0;
+		goto out;
 
 	length = blocksize - length;
 	iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
 
 	page = grab_cache_page(mapping, index);
-	err = -ENOMEM;
-	if (!page)
+	if (!page) {
+		err = -ENOMEM;
 		goto out;
+	}
 
 	if (page_has_buffers(page)) {
 has_buffers:
@@ -2752,7 +2741,6 @@ has_buffers:
 	}
 	zero_user(page, offset, length);
 	set_page_dirty(page);
-	err = 0;
 
 unlock:
 	unlock_page(page);
@@ -2762,8 +2750,8 @@ out:
 }
 EXPORT_SYMBOL(nobh_truncate_page);
 
-int block_truncate_page(struct address_space *mapping,
-			loff_t from, get_block_t *get_block)
+int __block_truncate_page(struct address_space *mapping,
+			loff_t from, loff_t to, get_block_t *get_block)
 {
 	pgoff_t index = from >> PAGE_CACHE_SHIFT;
 	unsigned offset = from & (PAGE_CACHE_SIZE-1);
@@ -2773,22 +2761,22 @@ int block_truncate_page(struct address_s
 	struct inode *inode = mapping->host;
 	struct page *page;
 	struct buffer_head *bh;
-	int err;
+	int err = 0;
 
-	blocksize = 1 << inode->i_blkbits;
-	length = offset & (blocksize - 1);
+	/* Page boundary? Nothing to do */
+	if (!offset)
+		goto out;
 
-	/* Block boundary? Nothing to do */
-	if (!length)
-		return 0;
+	blocksize = 1 << inode->i_blkbits;
 
-	length = blocksize - length;
+	length = (unsigned)min_t(loff_t, to - from, PAGE_CACHE_SIZE - offset);
 	iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
 	
 	page = grab_cache_page(mapping, index);
-	err = -ENOMEM;
-	if (!page)
+	if (!page) {
+		err = -ENOMEM;
 		goto out;
+	}
 
 	if (!page_has_buffers(page))
 		create_empty_buffers(page, blocksize, 0);
@@ -2802,15 +2790,20 @@ int block_truncate_page(struct address_s
 		pos += blocksize;
 	}
 
-	err = 0;
 	if (!buffer_mapped(bh)) {
 		WARN_ON(bh->b_size != blocksize);
 		err = get_block(inode, iblock, bh, 0);
 		if (err)
 			goto unlock;
-		/* unmapped? It's a hole - nothing to do */
-		if (!buffer_mapped(bh))
+		/*
+		 * unmapped? It's a hole - must zero out partial
+		 * in the case of an extending truncate where mmap has
+		 * previously written past i_size of the page
+		 */
+		if (!buffer_mapped(bh)) {
+			zero_user(page, offset, length);
 			goto unlock;
+		}
 	}
 
 	/* Ok, it's mapped. Make sure it's up-to-date */
@@ -2818,17 +2811,17 @@ int block_truncate_page(struct address_s
 		set_buffer_uptodate(bh);
 
 	if (!buffer_uptodate(bh) && !buffer_delay(bh) && !buffer_unwritten(bh)) {
-		err = -EIO;
 		ll_rw_block(READ, 1, &bh);
 		wait_on_buffer(bh);
 		/* Uhhuh. Read error. Complain and punt. */
-		if (!buffer_uptodate(bh))
+		if (!buffer_uptodate(bh)) {
+			err = -EIO;
 			goto unlock;
+		}
 	}
 
 	zero_user(page, offset, length);
 	mark_buffer_dirty(bh);
-	err = 0;
 
 unlock:
 	unlock_page(page);
@@ -2836,6 +2829,19 @@ unlock:
 out:
 	return err;
 }
+EXPORT_SYMBOL(__block_truncate_page);
+
+int block_truncate_page(struct address_space *mapping,
+			loff_t from, get_block_t *get_block)
+{
+	struct inode *inode = mapping->host;
+	int err = 0;
+
+	if (from > inode->i_size)
+		err = __block_truncate_page(mapping, inode->i_size, from, get_block);
+
+	return err;
+}
 
 /*
  * The generic ->writepage function for buffer-backed address_spaces
@@ -2849,26 +2855,14 @@ int block_write_full_page_endio(struct p
 	const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
 	unsigned offset;
 
-	/* Is the page fully inside i_size? */
-	if (page->index < end_index)
-		return __block_write_full_page(inode, page, get_block, wbc,
-					       handler);
-
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_CACHE_SIZE-1);
-	if (page->index >= end_index+1 || !offset) {
+	if (page->index >= end_index &&
+			(page->index >= end_index+1 || !offset)) {
 		unlock_page(page);
 		return 0;
 	}
 
-	/*
-	 * The page straddles i_size.  It must be zeroed out on each and every
-	 * writepage invokation because it may be mmapped.  "A file is mapped
-	 * in multiples of the page size.  For a file that is not a multiple of
-	 * the  page size, the remaining memory is zeroed when mapped, and
-	 * writes to that region are not written out to the file."
-	 */
-	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
 	return __block_write_full_page(inode, page, get_block, wbc, handler);
 }
 
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c
+++ linux-2.6/mm/filemap_xip.c
@@ -440,22 +440,23 @@ EXPORT_SYMBOL_GPL(xip_file_write);
 int
 xip_truncate_page(struct address_space *mapping, loff_t from)
 {
+	struct inode *inode = mapping->host;
 	pgoff_t index = from >> PAGE_CACHE_SHIFT;
 	unsigned offset = from & (PAGE_CACHE_SIZE-1);
 	unsigned blocksize;
 	unsigned length;
 	void *xip_mem;
 	unsigned long xip_pfn;
-	int err;
+	int err = 0;
 
 	BUG_ON(!mapping->a_ops->get_xip_mem);
 
-	blocksize = 1 << mapping->host->i_blkbits;
+	blocksize = 1 << inode->i_blkbits;
 	length = offset & (blocksize - 1);
 
 	/* Block boundary? Nothing to do */
 	if (!length)
-		return 0;
+		goto out;
 
 	length = blocksize - length;
 
@@ -464,11 +465,11 @@ xip_truncate_page(struct address_space *
 	if (unlikely(err)) {
 		if (err == -ENODATA)
 			/* Hole? No need to truncate */
-			return 0;
-		else
-			return err;
+			err = 0;
+		goto out;
 	}
 	memset(xip_mem + offset, 0, length);
-	return 0;
+out:
+	return err;
 }
 EXPORT_SYMBOL_GPL(xip_truncate_page);
Index: linux-2.6/fs/mpage.c
===================================================================
--- linux-2.6.orig/fs/mpage.c
+++ linux-2.6/fs/mpage.c
@@ -559,19 +559,10 @@ static int __mpage_writepage(struct page
 page_is_mapped:
 	end_index = i_size >> PAGE_CACHE_SHIFT;
 	if (page->index >= end_index) {
-		/*
-		 * The page straddles i_size.  It must be zeroed out on each
-		 * and every writepage invokation because it may be mmapped.
-		 * "A file is mapped in multiples of the page size.  For a file
-		 * that is not a multiple of the page size, the remaining memory
-		 * is zeroed when mapped, and writes to that region are not
-		 * written out to the file."
-		 */
 		unsigned offset = i_size & (PAGE_CACHE_SIZE - 1);
 
-		if (page->index > end_index || !offset)
-			goto confused;
-		zero_user_segment(page, offset, PAGE_CACHE_SIZE);
+		if (page->index >= end_index+1 || !offset)
+			goto confused; /* page fully outside i_size */
 	}
 
 	/*
Index: linux-2.6/fs/ext2/inode.c
===================================================================
--- linux-2.6.orig/fs/ext2/inode.c
+++ linux-2.6/fs/ext2/inode.c
@@ -777,14 +777,40 @@ ext2_write_begin(struct file *file, stru
 	return ret;
 }
 
+int __block_truncate_page(struct address_space *mapping,
+			loff_t from, loff_t to, get_block_t *get_block);
 static int ext2_write_end(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned copied,
 			struct page *page, void *fsdata)
 {
+	struct inode *inode = mapping->host;
 	int ret;
 
-	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
-	if (ret < len) {
+	ret = block_write_end(file, mapping, pos, len, copied, page, fsdata);
+	unlock_page(page);
+	page_cache_release(page);
+        if (pos+copied > inode->i_size) {
+		int err;
+                if (pos > inode->i_size) {
+                        /* expanding a hole */
+			err = __block_truncate_page(mapping, inode->i_size,
+						pos, ext2_get_block);
+			if (err) {
+				ret = err;
+				goto out;
+			}
+			err = __block_truncate_page(mapping, pos+copied,
+						LLONG_MAX, ext2_get_block);
+			if (err) {
+				ret = err;
+				goto out;
+			}
+                }
+                i_size_write(inode, pos+copied);
+                mark_inode_dirty(inode);
+        }
+out:
+	if (ret < 0 || ret < len) {
 		struct inode *inode = mapping->host;
 		loff_t isize = inode->i_size;
 		if (pos + len > isize)
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -49,7 +49,6 @@ void do_invalidatepage(struct page *page
 
 static inline void truncate_partial_page(struct page *page, unsigned partial)
 {
-	zero_user_segment(page, partial, PAGE_CACHE_SIZE);
 	if (page_has_private(page))
 		do_invalidatepage(page, partial);
 }

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 2/3] fs: buffer_head writepage no zero
@ 2009-07-10  9:34     ` Nick Piggin
  0 siblings, 0 replies; 28+ messages in thread
From: Nick Piggin @ 2009-07-10  9:34 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: hch, viro, jack, linux-mm


When writing a page to filesystem, buffer.c zeroes out parts of the page past
i_size in an attempt to get zeroes into those blocks on disk, so as to honour
the requirement that an expanding truncate should zero-fill the file.

Unfortunately, this is racy. The reason we can get something other than
zeroes here is via an mmaped write to the block beyond i_size. Zeroing it
out before writepage narrows the window, but it is still possible to store
junk beyond i_size on disk, by storing into the page after writepage zeroes,
but before DMA (or copy) completes. This allows process A to break posix
semantics for process B (or even inadvertently for itsef).

It could also be possible that the filesystem has written data into the
block but not yet expanded the inode size when the system crashes for
some reason. Unless its journal reply / fsck process etc checks for this
condition, it could also cause subsequent breakage in semantics.

---
 fs/buffer.c      |   94 +++++++++++++++++++++++++------------------------------
 fs/ext2/inode.c  |   30 ++++++++++++++++-
 fs/mpage.c       |   13 +------
 mm/filemap_xip.c |   15 ++++----
 mm/truncate.c    |    1 
 5 files changed, 82 insertions(+), 71 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2656,26 +2656,14 @@ int nobh_writepage(struct page *page, ge
 	unsigned offset;
 	int ret;
 
-	/* Is the page fully inside i_size? */
-	if (page->index < end_index)
-		goto out;
-
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_CACHE_SIZE-1);
-	if (page->index >= end_index+1 || !offset) {
+	if (page->index >= end_index &&
+			(page->index >= end_index+1 || !offset)) {
 		unlock_page(page);
 		return 0;
 	}
 
-	/*
-	 * The page straddles i_size.  It must be zeroed out on each and every
-	 * writepage invocation because it may be mmapped.  "A file is mapped
-	 * in multiples of the page size.  For a file that is not a multiple of
-	 * the  page size, the remaining memory is zeroed when mapped, and
-	 * writes to that region are not written out to the file."
-	 */
-	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
-out:
 	ret = mpage_writepage(page, get_block, wbc);
 	if (ret == -EAGAIN)
 		ret = __block_write_full_page(inode, page, get_block, wbc,
@@ -2695,22 +2683,23 @@ int nobh_truncate_page(struct address_sp
 	struct inode *inode = mapping->host;
 	struct page *page;
 	struct buffer_head map_bh;
-	int err;
+	int err = 0;
 
 	blocksize = 1 << inode->i_blkbits;
 	length = offset & (blocksize - 1);
 
 	/* Block boundary? Nothing to do */
 	if (!length)
-		return 0;
+		goto out;
 
 	length = blocksize - length;
 	iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
 
 	page = grab_cache_page(mapping, index);
-	err = -ENOMEM;
-	if (!page)
+	if (!page) {
+		err = -ENOMEM;
 		goto out;
+	}
 
 	if (page_has_buffers(page)) {
 has_buffers:
@@ -2752,7 +2741,6 @@ has_buffers:
 	}
 	zero_user(page, offset, length);
 	set_page_dirty(page);
-	err = 0;
 
 unlock:
 	unlock_page(page);
@@ -2762,8 +2750,8 @@ out:
 }
 EXPORT_SYMBOL(nobh_truncate_page);
 
-int block_truncate_page(struct address_space *mapping,
-			loff_t from, get_block_t *get_block)
+int __block_truncate_page(struct address_space *mapping,
+			loff_t from, loff_t to, get_block_t *get_block)
 {
 	pgoff_t index = from >> PAGE_CACHE_SHIFT;
 	unsigned offset = from & (PAGE_CACHE_SIZE-1);
@@ -2773,22 +2761,22 @@ int block_truncate_page(struct address_s
 	struct inode *inode = mapping->host;
 	struct page *page;
 	struct buffer_head *bh;
-	int err;
+	int err = 0;
 
-	blocksize = 1 << inode->i_blkbits;
-	length = offset & (blocksize - 1);
+	/* Page boundary? Nothing to do */
+	if (!offset)
+		goto out;
 
-	/* Block boundary? Nothing to do */
-	if (!length)
-		return 0;
+	blocksize = 1 << inode->i_blkbits;
 
-	length = blocksize - length;
+	length = (unsigned)min_t(loff_t, to - from, PAGE_CACHE_SIZE - offset);
 	iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
 	
 	page = grab_cache_page(mapping, index);
-	err = -ENOMEM;
-	if (!page)
+	if (!page) {
+		err = -ENOMEM;
 		goto out;
+	}
 
 	if (!page_has_buffers(page))
 		create_empty_buffers(page, blocksize, 0);
@@ -2802,15 +2790,20 @@ int block_truncate_page(struct address_s
 		pos += blocksize;
 	}
 
-	err = 0;
 	if (!buffer_mapped(bh)) {
 		WARN_ON(bh->b_size != blocksize);
 		err = get_block(inode, iblock, bh, 0);
 		if (err)
 			goto unlock;
-		/* unmapped? It's a hole - nothing to do */
-		if (!buffer_mapped(bh))
+		/*
+		 * unmapped? It's a hole - must zero out partial
+		 * in the case of an extending truncate where mmap has
+		 * previously written past i_size of the page
+		 */
+		if (!buffer_mapped(bh)) {
+			zero_user(page, offset, length);
 			goto unlock;
+		}
 	}
 
 	/* Ok, it's mapped. Make sure it's up-to-date */
@@ -2818,17 +2811,17 @@ int block_truncate_page(struct address_s
 		set_buffer_uptodate(bh);
 
 	if (!buffer_uptodate(bh) && !buffer_delay(bh) && !buffer_unwritten(bh)) {
-		err = -EIO;
 		ll_rw_block(READ, 1, &bh);
 		wait_on_buffer(bh);
 		/* Uhhuh. Read error. Complain and punt. */
-		if (!buffer_uptodate(bh))
+		if (!buffer_uptodate(bh)) {
+			err = -EIO;
 			goto unlock;
+		}
 	}
 
 	zero_user(page, offset, length);
 	mark_buffer_dirty(bh);
-	err = 0;
 
 unlock:
 	unlock_page(page);
@@ -2836,6 +2829,19 @@ unlock:
 out:
 	return err;
 }
+EXPORT_SYMBOL(__block_truncate_page);
+
+int block_truncate_page(struct address_space *mapping,
+			loff_t from, get_block_t *get_block)
+{
+	struct inode *inode = mapping->host;
+	int err = 0;
+
+	if (from > inode->i_size)
+		err = __block_truncate_page(mapping, inode->i_size, from, get_block);
+
+	return err;
+}
 
 /*
  * The generic ->writepage function for buffer-backed address_spaces
@@ -2849,26 +2855,14 @@ int block_write_full_page_endio(struct p
 	const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
 	unsigned offset;
 
-	/* Is the page fully inside i_size? */
-	if (page->index < end_index)
-		return __block_write_full_page(inode, page, get_block, wbc,
-					       handler);
-
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_CACHE_SIZE-1);
-	if (page->index >= end_index+1 || !offset) {
+	if (page->index >= end_index &&
+			(page->index >= end_index+1 || !offset)) {
 		unlock_page(page);
 		return 0;
 	}
 
-	/*
-	 * The page straddles i_size.  It must be zeroed out on each and every
-	 * writepage invokation because it may be mmapped.  "A file is mapped
-	 * in multiples of the page size.  For a file that is not a multiple of
-	 * the  page size, the remaining memory is zeroed when mapped, and
-	 * writes to that region are not written out to the file."
-	 */
-	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
 	return __block_write_full_page(inode, page, get_block, wbc, handler);
 }
 
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c
+++ linux-2.6/mm/filemap_xip.c
@@ -440,22 +440,23 @@ EXPORT_SYMBOL_GPL(xip_file_write);
 int
 xip_truncate_page(struct address_space *mapping, loff_t from)
 {
+	struct inode *inode = mapping->host;
 	pgoff_t index = from >> PAGE_CACHE_SHIFT;
 	unsigned offset = from & (PAGE_CACHE_SIZE-1);
 	unsigned blocksize;
 	unsigned length;
 	void *xip_mem;
 	unsigned long xip_pfn;
-	int err;
+	int err = 0;
 
 	BUG_ON(!mapping->a_ops->get_xip_mem);
 
-	blocksize = 1 << mapping->host->i_blkbits;
+	blocksize = 1 << inode->i_blkbits;
 	length = offset & (blocksize - 1);
 
 	/* Block boundary? Nothing to do */
 	if (!length)
-		return 0;
+		goto out;
 
 	length = blocksize - length;
 
@@ -464,11 +465,11 @@ xip_truncate_page(struct address_space *
 	if (unlikely(err)) {
 		if (err == -ENODATA)
 			/* Hole? No need to truncate */
-			return 0;
-		else
-			return err;
+			err = 0;
+		goto out;
 	}
 	memset(xip_mem + offset, 0, length);
-	return 0;
+out:
+	return err;
 }
 EXPORT_SYMBOL_GPL(xip_truncate_page);
Index: linux-2.6/fs/mpage.c
===================================================================
--- linux-2.6.orig/fs/mpage.c
+++ linux-2.6/fs/mpage.c
@@ -559,19 +559,10 @@ static int __mpage_writepage(struct page
 page_is_mapped:
 	end_index = i_size >> PAGE_CACHE_SHIFT;
 	if (page->index >= end_index) {
-		/*
-		 * The page straddles i_size.  It must be zeroed out on each
-		 * and every writepage invokation because it may be mmapped.
-		 * "A file is mapped in multiples of the page size.  For a file
-		 * that is not a multiple of the page size, the remaining memory
-		 * is zeroed when mapped, and writes to that region are not
-		 * written out to the file."
-		 */
 		unsigned offset = i_size & (PAGE_CACHE_SIZE - 1);
 
-		if (page->index > end_index || !offset)
-			goto confused;
-		zero_user_segment(page, offset, PAGE_CACHE_SIZE);
+		if (page->index >= end_index+1 || !offset)
+			goto confused; /* page fully outside i_size */
 	}
 
 	/*
Index: linux-2.6/fs/ext2/inode.c
===================================================================
--- linux-2.6.orig/fs/ext2/inode.c
+++ linux-2.6/fs/ext2/inode.c
@@ -777,14 +777,40 @@ ext2_write_begin(struct file *file, stru
 	return ret;
 }
 
+int __block_truncate_page(struct address_space *mapping,
+			loff_t from, loff_t to, get_block_t *get_block);
 static int ext2_write_end(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned copied,
 			struct page *page, void *fsdata)
 {
+	struct inode *inode = mapping->host;
 	int ret;
 
-	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
-	if (ret < len) {
+	ret = block_write_end(file, mapping, pos, len, copied, page, fsdata);
+	unlock_page(page);
+	page_cache_release(page);
+        if (pos+copied > inode->i_size) {
+		int err;
+                if (pos > inode->i_size) {
+                        /* expanding a hole */
+			err = __block_truncate_page(mapping, inode->i_size,
+						pos, ext2_get_block);
+			if (err) {
+				ret = err;
+				goto out;
+			}
+			err = __block_truncate_page(mapping, pos+copied,
+						LLONG_MAX, ext2_get_block);
+			if (err) {
+				ret = err;
+				goto out;
+			}
+                }
+                i_size_write(inode, pos+copied);
+                mark_inode_dirty(inode);
+        }
+out:
+	if (ret < 0 || ret < len) {
 		struct inode *inode = mapping->host;
 		loff_t isize = inode->i_size;
 		if (pos + len > isize)
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -49,7 +49,6 @@ void do_invalidatepage(struct page *page
 
 static inline void truncate_partial_page(struct page *page, unsigned partial)
 {
-	zero_user_segment(page, partial, PAGE_CACHE_SIZE);
 	if (page_has_private(page))
 		do_invalidatepage(page, partial);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 3/3] fs: buffer_head page_lock i_size relax
  2009-07-10  9:33   ` Nick Piggin
@ 2009-07-10  9:35     ` Nick Piggin
  -1 siblings, 0 replies; 28+ messages in thread
From: Nick Piggin @ 2009-07-10  9:35 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: hch, viro, jack, linux-mm


The previous patch allows us to relax the buffer.c requirement that the page
lock be held in order to avoid writepage zeroing out new data beyond isize.
[actually as I said I think it still has a bug due to writepage requiring
i_size_read]

---
 fs/buffer.c |   28 ++++++++--------------------
 mm/shmem.c  |    6 +++---
 2 files changed, 11 insertions(+), 23 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2049,33 +2049,20 @@ int generic_write_end(struct file *file,
 			struct page *page, void *fsdata)
 {
 	struct inode *inode = mapping->host;
-	int i_size_changed = 0;
 
 	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
 
+	unlock_page(page);
+	page_cache_release(page);
+
 	/*
 	 * No need to use i_size_read() here, the i_size
 	 * cannot change under us because we hold i_mutex.
-	 *
-	 * But it's important to update i_size while still holding page lock:
-	 * page writeout could otherwise come in and zero beyond i_size.
 	 */
 	if (pos+copied > inode->i_size) {
 		i_size_write(inode, pos+copied);
-		i_size_changed = 1;
-	}
-
-	unlock_page(page);
-	page_cache_release(page);
-
-	/*
-	 * Don't mark the inode dirty under page lock. First, it unnecessarily
-	 * makes the holding time of page lock longer. Second, it forces lock
-	 * ordering of page lock and transaction start for journaling
-	 * filesystems.
-	 */
-	if (i_size_changed)
 		mark_inode_dirty(inode);
+	}
 
 	return copied;
 }
@@ -2624,14 +2611,15 @@ int nobh_write_end(struct file *file, st
 
 	SetPageUptodate(page);
 	set_page_dirty(page);
+
+	unlock_page(page);
+	page_cache_release(page);
+
 	if (pos+copied > inode->i_size) {
 		i_size_write(inode, pos+copied);
 		mark_inode_dirty(inode);
 	}
 
-	unlock_page(page);
-	page_cache_release(page);
-
 	while (head) {
 		bh = head;
 		head = head->b_this_page;
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -1631,13 +1631,13 @@ shmem_write_end(struct file *file, struc
 {
 	struct inode *inode = mapping->host;
 
-	if (pos + copied > inode->i_size)
-		i_size_write(inode, pos + copied);
-
 	unlock_page(page);
 	set_page_dirty(page);
 	page_cache_release(page);
 
+	if (pos + copied > inode->i_size)
+		i_size_write(inode, pos + copied);
+
 	return copied;
 }
 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 3/3] fs: buffer_head page_lock i_size relax
@ 2009-07-10  9:35     ` Nick Piggin
  0 siblings, 0 replies; 28+ messages in thread
From: Nick Piggin @ 2009-07-10  9:35 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: hch, viro, jack, linux-mm


The previous patch allows us to relax the buffer.c requirement that the page
lock be held in order to avoid writepage zeroing out new data beyond isize.
[actually as I said I think it still has a bug due to writepage requiring
i_size_read]

---
 fs/buffer.c |   28 ++++++++--------------------
 mm/shmem.c  |    6 +++---
 2 files changed, 11 insertions(+), 23 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2049,33 +2049,20 @@ int generic_write_end(struct file *file,
 			struct page *page, void *fsdata)
 {
 	struct inode *inode = mapping->host;
-	int i_size_changed = 0;
 
 	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
 
+	unlock_page(page);
+	page_cache_release(page);
+
 	/*
 	 * No need to use i_size_read() here, the i_size
 	 * cannot change under us because we hold i_mutex.
-	 *
-	 * But it's important to update i_size while still holding page lock:
-	 * page writeout could otherwise come in and zero beyond i_size.
 	 */
 	if (pos+copied > inode->i_size) {
 		i_size_write(inode, pos+copied);
-		i_size_changed = 1;
-	}
-
-	unlock_page(page);
-	page_cache_release(page);
-
-	/*
-	 * Don't mark the inode dirty under page lock. First, it unnecessarily
-	 * makes the holding time of page lock longer. Second, it forces lock
-	 * ordering of page lock and transaction start for journaling
-	 * filesystems.
-	 */
-	if (i_size_changed)
 		mark_inode_dirty(inode);
+	}
 
 	return copied;
 }
@@ -2624,14 +2611,15 @@ int nobh_write_end(struct file *file, st
 
 	SetPageUptodate(page);
 	set_page_dirty(page);
+
+	unlock_page(page);
+	page_cache_release(page);
+
 	if (pos+copied > inode->i_size) {
 		i_size_write(inode, pos+copied);
 		mark_inode_dirty(inode);
 	}
 
-	unlock_page(page);
-	page_cache_release(page);
-
 	while (head) {
 		bh = head;
 		head = head->b_this_page;
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -1631,13 +1631,13 @@ shmem_write_end(struct file *file, struc
 {
 	struct inode *inode = mapping->host;
 
-	if (pos + copied > inode->i_size)
-		i_size_write(inode, pos + copied);
-
 	unlock_page(page);
 	set_page_dirty(page);
 	page_cache_release(page);
 
+	if (pos + copied > inode->i_size)
+		i_size_write(inode, pos + copied);
+
 	return copied;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 1/3] fs: buffer_head writepage no invalidate
  2009-07-10  9:33   ` Nick Piggin
@ 2009-07-10 11:08     ` Jan Kara
  -1 siblings, 0 replies; 28+ messages in thread
From: Jan Kara @ 2009-07-10 11:08 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, hch, viro, linux-mm

On Fri 10-07-09 11:33:25, Nick Piggin wrote:
> After the previous patchset, this is my progress on the page_mkwrite
> thing... These patches are RFC only and have some bugs.
> --
> invalidate should not be required in the writeout path. The truncate
> sequence will first reduce i_size, then clean and discard any existing
> pagecache (and no new dirty pagecache can be added because i_size was
> reduced and i_mutex is being held), then filesystem data structures
> are updated.
> 
> Filesystem needs to be able to handle writeout at any point before
> the last step, and once the 2nd step completes, there should be no
> unfreeable dirty buffers anyway (truncate performs the do_invalidatepage).
> 
> Having filesystem changes depend on reading i_size without holding
> i_mutex is confusing at least. There is still a case in writepage
> paths in buffer.c uses i_size (testing which block to write out), but
> this is a small improvement.
  This looks good.
Acked-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/buffer.c |   20 ++------------------
>  1 file changed, 2 insertions(+), 18 deletions(-)
> 
> Index: linux-2.6/fs/buffer.c
> ===================================================================
> --- linux-2.6.orig/fs/buffer.c
> +++ linux-2.6/fs/buffer.c
> @@ -2663,18 +2663,8 @@ int nobh_writepage(struct page *page, ge
>  	/* Is the page fully outside i_size? (truncate in progress) */
>  	offset = i_size & (PAGE_CACHE_SIZE-1);
>  	if (page->index >= end_index+1 || !offset) {
> -		/*
> -		 * The page may have dirty, unmapped buffers.  For example,
> -		 * they may have been added in ext3_writepage().  Make them
> -		 * freeable here, so the page does not leak.
> -		 */
> -#if 0
> -		/* Not really sure about this  - do we need this ? */
> -		if (page->mapping->a_ops->invalidatepage)
> -			page->mapping->a_ops->invalidatepage(page, offset);
> -#endif
>  		unlock_page(page);
> -		return 0; /* don't care */
> +		return 0;
>  	}
>  
>  	/*
> @@ -2867,14 +2857,8 @@ int block_write_full_page_endio(struct p
>  	/* Is the page fully outside i_size? (truncate in progress) */
>  	offset = i_size & (PAGE_CACHE_SIZE-1);
>  	if (page->index >= end_index+1 || !offset) {
> -		/*
> -		 * The page may have dirty, unmapped buffers.  For example,
> -		 * they may have been added in ext3_writepage().  Make them
> -		 * freeable here, so the page does not leak.
> -		 */
> -		do_invalidatepage(page, 0);
>  		unlock_page(page);
> -		return 0; /* don't care */
> +		return 0;
>  	}
>  
>  	/*
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 1/3] fs: buffer_head writepage no invalidate
@ 2009-07-10 11:08     ` Jan Kara
  0 siblings, 0 replies; 28+ messages in thread
From: Jan Kara @ 2009-07-10 11:08 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, hch, viro, linux-mm

On Fri 10-07-09 11:33:25, Nick Piggin wrote:
> After the previous patchset, this is my progress on the page_mkwrite
> thing... These patches are RFC only and have some bugs.
> --
> invalidate should not be required in the writeout path. The truncate
> sequence will first reduce i_size, then clean and discard any existing
> pagecache (and no new dirty pagecache can be added because i_size was
> reduced and i_mutex is being held), then filesystem data structures
> are updated.
> 
> Filesystem needs to be able to handle writeout at any point before
> the last step, and once the 2nd step completes, there should be no
> unfreeable dirty buffers anyway (truncate performs the do_invalidatepage).
> 
> Having filesystem changes depend on reading i_size without holding
> i_mutex is confusing at least. There is still a case in writepage
> paths in buffer.c uses i_size (testing which block to write out), but
> this is a small improvement.
  This looks good.
Acked-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/buffer.c |   20 ++------------------
>  1 file changed, 2 insertions(+), 18 deletions(-)
> 
> Index: linux-2.6/fs/buffer.c
> ===================================================================
> --- linux-2.6.orig/fs/buffer.c
> +++ linux-2.6/fs/buffer.c
> @@ -2663,18 +2663,8 @@ int nobh_writepage(struct page *page, ge
>  	/* Is the page fully outside i_size? (truncate in progress) */
>  	offset = i_size & (PAGE_CACHE_SIZE-1);
>  	if (page->index >= end_index+1 || !offset) {
> -		/*
> -		 * The page may have dirty, unmapped buffers.  For example,
> -		 * they may have been added in ext3_writepage().  Make them
> -		 * freeable here, so the page does not leak.
> -		 */
> -#if 0
> -		/* Not really sure about this  - do we need this ? */
> -		if (page->mapping->a_ops->invalidatepage)
> -			page->mapping->a_ops->invalidatepage(page, offset);
> -#endif
>  		unlock_page(page);
> -		return 0; /* don't care */
> +		return 0;
>  	}
>  
>  	/*
> @@ -2867,14 +2857,8 @@ int block_write_full_page_endio(struct p
>  	/* Is the page fully outside i_size? (truncate in progress) */
>  	offset = i_size & (PAGE_CACHE_SIZE-1);
>  	if (page->index >= end_index+1 || !offset) {
> -		/*
> -		 * The page may have dirty, unmapped buffers.  For example,
> -		 * they may have been added in ext3_writepage().  Make them
> -		 * freeable here, so the page does not leak.
> -		 */
> -		do_invalidatepage(page, 0);
>  		unlock_page(page);
> -		return 0; /* don't care */
> +		return 0;
>  	}
>  
>  	/*
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/3] fs: buffer_head writepage no zero
  2009-07-10  9:34     ` Nick Piggin
@ 2009-07-10 11:46       ` Jan Kara
  -1 siblings, 0 replies; 28+ messages in thread
From: Jan Kara @ 2009-07-10 11:46 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, hch, viro, jack, linux-mm

On Fri 10-07-09 11:34:03, Nick Piggin wrote:
> 
> When writing a page to filesystem, buffer.c zeroes out parts of the page past
> i_size in an attempt to get zeroes into those blocks on disk, so as to honour
> the requirement that an expanding truncate should zero-fill the file.
> 
> Unfortunately, this is racy. The reason we can get something other than
> zeroes here is via an mmaped write to the block beyond i_size. Zeroing it
> out before writepage narrows the window, but it is still possible to store
> junk beyond i_size on disk, by storing into the page after writepage zeroes,
> but before DMA (or copy) completes. This allows process A to break posix
> semantics for process B (or even inadvertently for itsef).
> 
> It could also be possible that the filesystem has written data into the
> block but not yet expanded the inode size when the system crashes for
> some reason. Unless its journal reply / fsck process etc checks for this
> condition, it could also cause subsequent breakage in semantics.
  Actually, it should be possible to fix the posix semantics by zeroing out
the page when i_size is going to be extended - hmm, I see you're trying to
do something like that in ext2 code. Ugh. Since we have to lock the
old last page to make mkwrite work anyway, I think we should do it in a
generic code (probably in a separate patch and just note it here...).
  I can include it in my mkwrite fixes when I port them on top of your
patches.

> ---
>  fs/buffer.c      |   94 +++++++++++++++++++++++++------------------------------
>  fs/ext2/inode.c  |   30 ++++++++++++++++-
>  fs/mpage.c       |   13 +------
>  mm/filemap_xip.c |   15 ++++----
>  mm/truncate.c    |    1 
>  5 files changed, 82 insertions(+), 71 deletions(-)
> 
> Index: linux-2.6/fs/buffer.c
> ===================================================================
> --- linux-2.6.orig/fs/buffer.c
> +++ linux-2.6/fs/buffer.c
> @@ -2656,26 +2656,14 @@ int nobh_writepage(struct page *page, ge
>  	unsigned offset;
>  	int ret;
>  
> -	/* Is the page fully inside i_size? */
> -	if (page->index < end_index)
> -		goto out;
> -
>  	/* Is the page fully outside i_size? (truncate in progress) */
>  	offset = i_size & (PAGE_CACHE_SIZE-1);
> -	if (page->index >= end_index+1 || !offset) {
> +	if (page->index >= end_index &&
> +			(page->index >= end_index+1 || !offset)) {
>  		unlock_page(page);
>  		return 0;
>  	}
>  
> -	/*
> -	 * The page straddles i_size.  It must be zeroed out on each and every
> -	 * writepage invocation because it may be mmapped.  "A file is mapped
> -	 * in multiples of the page size.  For a file that is not a multiple of
> -	 * the  page size, the remaining memory is zeroed when mapped, and
> -	 * writes to that region are not written out to the file."
> -	 */
> -	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
> -out:
>  	ret = mpage_writepage(page, get_block, wbc);
>  	if (ret == -EAGAIN)
>  		ret = __block_write_full_page(inode, page, get_block, wbc,
> @@ -2695,22 +2683,23 @@ int nobh_truncate_page(struct address_sp
>  	struct inode *inode = mapping->host;
>  	struct page *page;
>  	struct buffer_head map_bh;
> -	int err;
> +	int err = 0;
>  
>  	blocksize = 1 << inode->i_blkbits;
>  	length = offset & (blocksize - 1);
>  
>  	/* Block boundary? Nothing to do */
>  	if (!length)
> -		return 0;
> +		goto out;
>  
>  	length = blocksize - length;
>  	iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
>  
>  	page = grab_cache_page(mapping, index);
> -	err = -ENOMEM;
> -	if (!page)
> +	if (!page) {
> +		err = -ENOMEM;
>  		goto out;
> +	}
>  
>  	if (page_has_buffers(page)) {
>  has_buffers:
> @@ -2752,7 +2741,6 @@ has_buffers:
>  	}
>  	zero_user(page, offset, length);
>  	set_page_dirty(page);
> -	err = 0;
>  
>  unlock:
>  	unlock_page(page);
  Above two chunks are just style cleanup, aren't they? Could you maybe separate
it from the logical changes?

> @@ -2762,8 +2750,8 @@ out:
>  }
>  EXPORT_SYMBOL(nobh_truncate_page);
>  
> -int block_truncate_page(struct address_space *mapping,
> -			loff_t from, get_block_t *get_block)
> +int __block_truncate_page(struct address_space *mapping,
> +			loff_t from, loff_t to, get_block_t *get_block)
>  {
>  	pgoff_t index = from >> PAGE_CACHE_SHIFT;
>  	unsigned offset = from & (PAGE_CACHE_SIZE-1);
> @@ -2773,22 +2761,22 @@ int block_truncate_page(struct address_s
>  	struct inode *inode = mapping->host;
>  	struct page *page;
>  	struct buffer_head *bh;
> -	int err;
> +	int err = 0;
>  
> -	blocksize = 1 << inode->i_blkbits;
> -	length = offset & (blocksize - 1);
> +	/* Page boundary? Nothing to do */
> +	if (!offset)
> +		goto out;
>  
> -	/* Block boundary? Nothing to do */
> -	if (!length)
> -		return 0;
> +	blocksize = 1 << inode->i_blkbits;
>  
> -	length = blocksize - length;
> +	length = (unsigned)min_t(loff_t, to - from, PAGE_CACHE_SIZE - offset);
>  	iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
>  	page = grab_cache_page(mapping, index);
> -	err = -ENOMEM;
> -	if (!page)
> +	if (!page) {
> +		err = -ENOMEM;
>  		goto out;
> +	}
>  
>  	if (!page_has_buffers(page))
>  		create_empty_buffers(page, blocksize, 0);
> @@ -2802,15 +2790,20 @@ int block_truncate_page(struct address_s
>  		pos += blocksize;
>  	}
>  
> -	err = 0;
>  	if (!buffer_mapped(bh)) {
>  		WARN_ON(bh->b_size != blocksize);
>  		err = get_block(inode, iblock, bh, 0);
>  		if (err)
>  			goto unlock;
> -		/* unmapped? It's a hole - nothing to do */
> -		if (!buffer_mapped(bh))
> +		/*
> +		 * unmapped? It's a hole - must zero out partial
> +		 * in the case of an extending truncate where mmap has
> +		 * previously written past i_size of the page
> +		 */
> +		if (!buffer_mapped(bh)) {
> +			zero_user(page, offset, length);
>  			goto unlock;
  Hmm, but who was zeroing out the page previously? Because the end of the
page gets zeroed already now...

> +		}
>  	}
>  
>  	/* Ok, it's mapped. Make sure it's up-to-date */
> @@ -2818,17 +2811,17 @@ int block_truncate_page(struct address_s
>  		set_buffer_uptodate(bh);
>  
>  	if (!buffer_uptodate(bh) && !buffer_delay(bh) && !buffer_unwritten(bh)) {
> -		err = -EIO;
>  		ll_rw_block(READ, 1, &bh);
>  		wait_on_buffer(bh);
>  		/* Uhhuh. Read error. Complain and punt. */
> -		if (!buffer_uptodate(bh))
> +		if (!buffer_uptodate(bh)) {
> +			err = -EIO;
>  			goto unlock;
> +		}
>  	}
>  
>  	zero_user(page, offset, length);
>  	mark_buffer_dirty(bh);
> -	err = 0;
>  
>  unlock:
>  	unlock_page(page);
> @@ -2836,6 +2829,19 @@ unlock:
>  out:
>  	return err;
>  }
> +EXPORT_SYMBOL(__block_truncate_page);
> +
> +int block_truncate_page(struct address_space *mapping,
> +			loff_t from, get_block_t *get_block)
> +{
> +	struct inode *inode = mapping->host;
> +	int err = 0;
> +
> +	if (from > inode->i_size)
> +		err = __block_truncate_page(mapping, inode->i_size, from, get_block);
> +
> +	return err;
> +}
>  
>  /*
>   * The generic ->writepage function for buffer-backed address_spaces
> @@ -2849,26 +2855,14 @@ int block_write_full_page_endio(struct p
>  	const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
>  	unsigned offset;
>  
> -	/* Is the page fully inside i_size? */
> -	if (page->index < end_index)
> -		return __block_write_full_page(inode, page, get_block, wbc,
> -					       handler);
> -
>  	/* Is the page fully outside i_size? (truncate in progress) */
>  	offset = i_size & (PAGE_CACHE_SIZE-1);
> -	if (page->index >= end_index+1 || !offset) {
> +	if (page->index >= end_index &&
> +			(page->index >= end_index+1 || !offset)) {
>  		unlock_page(page);
>  		return 0;
>  	}
>  
> -	/*
> -	 * The page straddles i_size.  It must be zeroed out on each and every
> -	 * writepage invokation because it may be mmapped.  "A file is mapped
> -	 * in multiples of the page size.  For a file that is not a multiple of
> -	 * the  page size, the remaining memory is zeroed when mapped, and
> -	 * writes to that region are not written out to the file."
> -	 */
> -	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
>  	return __block_write_full_page(inode, page, get_block, wbc, handler);
>  }
  I suppose you should also update __block_write_full_page() - there's
comment about zeroing. Also I'm not sure that marking buffer as uptodate
there is a good idea when the buffer isn't zeroed.

> Index: linux-2.6/mm/filemap_xip.c
> ===================================================================
> --- linux-2.6.orig/mm/filemap_xip.c
> +++ linux-2.6/mm/filemap_xip.c
> @@ -440,22 +440,23 @@ EXPORT_SYMBOL_GPL(xip_file_write);
>  int
>  xip_truncate_page(struct address_space *mapping, loff_t from)
>  {
> +	struct inode *inode = mapping->host;
>  	pgoff_t index = from >> PAGE_CACHE_SHIFT;
>  	unsigned offset = from & (PAGE_CACHE_SIZE-1);
>  	unsigned blocksize;
>  	unsigned length;
>  	void *xip_mem;
>  	unsigned long xip_pfn;
> -	int err;
> +	int err = 0;
>  
>  	BUG_ON(!mapping->a_ops->get_xip_mem);
>  
> -	blocksize = 1 << mapping->host->i_blkbits;
> +	blocksize = 1 << inode->i_blkbits;
>  	length = offset & (blocksize - 1);
>  
>  	/* Block boundary? Nothing to do */
>  	if (!length)
> -		return 0;
> +		goto out;
>  
>  	length = blocksize - length;
>  
> @@ -464,11 +465,11 @@ xip_truncate_page(struct address_space *
>  	if (unlikely(err)) {
>  		if (err == -ENODATA)
>  			/* Hole? No need to truncate */
> -			return 0;
> -		else
> -			return err;
> +			err = 0;
> +		goto out;
>  	}
>  	memset(xip_mem + offset, 0, length);
> -	return 0;
> +out:
> +	return err;
>  }
>  EXPORT_SYMBOL_GPL(xip_truncate_page);
  Again, only a style change, right?

> Index: linux-2.6/fs/mpage.c
> ===================================================================
> --- linux-2.6.orig/fs/mpage.c
> +++ linux-2.6/fs/mpage.c
> @@ -559,19 +559,10 @@ static int __mpage_writepage(struct page
>  page_is_mapped:
>  	end_index = i_size >> PAGE_CACHE_SHIFT;
>  	if (page->index >= end_index) {
> -		/*
> -		 * The page straddles i_size.  It must be zeroed out on each
> -		 * and every writepage invokation because it may be mmapped.
> -		 * "A file is mapped in multiples of the page size.  For a file
> -		 * that is not a multiple of the page size, the remaining memory
> -		 * is zeroed when mapped, and writes to that region are not
> -		 * written out to the file."
> -		 */
>  		unsigned offset = i_size & (PAGE_CACHE_SIZE - 1);
>  
> -		if (page->index > end_index || !offset)
> -			goto confused;
> -		zero_user_segment(page, offset, PAGE_CACHE_SIZE);
> +		if (page->index >= end_index+1 || !offset)
> +			goto confused; /* page fully outside i_size */
>  	}
>  
>  	/*
> Index: linux-2.6/fs/ext2/inode.c
> ===================================================================
> --- linux-2.6.orig/fs/ext2/inode.c
> +++ linux-2.6/fs/ext2/inode.c
> @@ -777,14 +777,40 @@ ext2_write_begin(struct file *file, stru
>  	return ret;
>  }
>  
> +int __block_truncate_page(struct address_space *mapping,
> +			loff_t from, loff_t to, get_block_t *get_block);
  Uf, that's ugly... Shouldn't it be in some header?

>  static int ext2_write_end(struct file *file, struct address_space *mapping,
>  			loff_t pos, unsigned len, unsigned copied,
>  			struct page *page, void *fsdata)
>  {
> +	struct inode *inode = mapping->host;
>  	int ret;
>  
> -	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
> -	if (ret < len) {
> +	ret = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> +	unlock_page(page);
> +	page_cache_release(page);
> +        if (pos+copied > inode->i_size) {
> +		int err;
> +                if (pos > inode->i_size) {
> +                        /* expanding a hole */
> +			err = __block_truncate_page(mapping, inode->i_size,
> +						pos, ext2_get_block);
> +			if (err) {
> +				ret = err;
> +				goto out;
> +			}
> +			err = __block_truncate_page(mapping, pos+copied,
> +						LLONG_MAX, ext2_get_block);
> +			if (err) {
> +				ret = err;
> +				goto out;
> +			}
> +                }
> +                i_size_write(inode, pos+copied);
> +                mark_inode_dirty(inode);
> +        }
> +out:
> +	if (ret < 0 || ret < len) {
>  		struct inode *inode = mapping->host;
>  		loff_t isize = inode->i_size;
>  		if (pos + len > isize)
  There are whitespace problems above... Also calling __block_truncate_page()
on old i_size looks strange - we just want to zero-out the page if it
exists (this way we'd unnecessarily read it from disk). Also I think
block_write_end() should do this.
  Finally, zeroing after pos+copied does not make sence to be conditioned
by pos > inode->i_size and again I don't think it's needed...

> Index: linux-2.6/mm/truncate.c
> ===================================================================
> --- linux-2.6.orig/mm/truncate.c
> +++ linux-2.6/mm/truncate.c
> @@ -49,7 +49,6 @@ void do_invalidatepage(struct page *page
>  
>  static inline void truncate_partial_page(struct page *page, unsigned partial)
>  {
> -	zero_user_segment(page, partial, PAGE_CACHE_SIZE);
>  	if (page_has_private(page))
>  		do_invalidatepage(page, partial);
>  }

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/3] fs: buffer_head writepage no zero
@ 2009-07-10 11:46       ` Jan Kara
  0 siblings, 0 replies; 28+ messages in thread
From: Jan Kara @ 2009-07-10 11:46 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, hch, viro, jack, linux-mm

On Fri 10-07-09 11:34:03, Nick Piggin wrote:
> 
> When writing a page to filesystem, buffer.c zeroes out parts of the page past
> i_size in an attempt to get zeroes into those blocks on disk, so as to honour
> the requirement that an expanding truncate should zero-fill the file.
> 
> Unfortunately, this is racy. The reason we can get something other than
> zeroes here is via an mmaped write to the block beyond i_size. Zeroing it
> out before writepage narrows the window, but it is still possible to store
> junk beyond i_size on disk, by storing into the page after writepage zeroes,
> but before DMA (or copy) completes. This allows process A to break posix
> semantics for process B (or even inadvertently for itsef).
> 
> It could also be possible that the filesystem has written data into the
> block but not yet expanded the inode size when the system crashes for
> some reason. Unless its journal reply / fsck process etc checks for this
> condition, it could also cause subsequent breakage in semantics.
  Actually, it should be possible to fix the posix semantics by zeroing out
the page when i_size is going to be extended - hmm, I see you're trying to
do something like that in ext2 code. Ugh. Since we have to lock the
old last page to make mkwrite work anyway, I think we should do it in a
generic code (probably in a separate patch and just note it here...).
  I can include it in my mkwrite fixes when I port them on top of your
patches.

> ---
>  fs/buffer.c      |   94 +++++++++++++++++++++++++------------------------------
>  fs/ext2/inode.c  |   30 ++++++++++++++++-
>  fs/mpage.c       |   13 +------
>  mm/filemap_xip.c |   15 ++++----
>  mm/truncate.c    |    1 
>  5 files changed, 82 insertions(+), 71 deletions(-)
> 
> Index: linux-2.6/fs/buffer.c
> ===================================================================
> --- linux-2.6.orig/fs/buffer.c
> +++ linux-2.6/fs/buffer.c
> @@ -2656,26 +2656,14 @@ int nobh_writepage(struct page *page, ge
>  	unsigned offset;
>  	int ret;
>  
> -	/* Is the page fully inside i_size? */
> -	if (page->index < end_index)
> -		goto out;
> -
>  	/* Is the page fully outside i_size? (truncate in progress) */
>  	offset = i_size & (PAGE_CACHE_SIZE-1);
> -	if (page->index >= end_index+1 || !offset) {
> +	if (page->index >= end_index &&
> +			(page->index >= end_index+1 || !offset)) {
>  		unlock_page(page);
>  		return 0;
>  	}
>  
> -	/*
> -	 * The page straddles i_size.  It must be zeroed out on each and every
> -	 * writepage invocation because it may be mmapped.  "A file is mapped
> -	 * in multiples of the page size.  For a file that is not a multiple of
> -	 * the  page size, the remaining memory is zeroed when mapped, and
> -	 * writes to that region are not written out to the file."
> -	 */
> -	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
> -out:
>  	ret = mpage_writepage(page, get_block, wbc);
>  	if (ret == -EAGAIN)
>  		ret = __block_write_full_page(inode, page, get_block, wbc,
> @@ -2695,22 +2683,23 @@ int nobh_truncate_page(struct address_sp
>  	struct inode *inode = mapping->host;
>  	struct page *page;
>  	struct buffer_head map_bh;
> -	int err;
> +	int err = 0;
>  
>  	blocksize = 1 << inode->i_blkbits;
>  	length = offset & (blocksize - 1);
>  
>  	/* Block boundary? Nothing to do */
>  	if (!length)
> -		return 0;
> +		goto out;
>  
>  	length = blocksize - length;
>  	iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
>  
>  	page = grab_cache_page(mapping, index);
> -	err = -ENOMEM;
> -	if (!page)
> +	if (!page) {
> +		err = -ENOMEM;
>  		goto out;
> +	}
>  
>  	if (page_has_buffers(page)) {
>  has_buffers:
> @@ -2752,7 +2741,6 @@ has_buffers:
>  	}
>  	zero_user(page, offset, length);
>  	set_page_dirty(page);
> -	err = 0;
>  
>  unlock:
>  	unlock_page(page);
  Above two chunks are just style cleanup, aren't they? Could you maybe separate
it from the logical changes?

> @@ -2762,8 +2750,8 @@ out:
>  }
>  EXPORT_SYMBOL(nobh_truncate_page);
>  
> -int block_truncate_page(struct address_space *mapping,
> -			loff_t from, get_block_t *get_block)
> +int __block_truncate_page(struct address_space *mapping,
> +			loff_t from, loff_t to, get_block_t *get_block)
>  {
>  	pgoff_t index = from >> PAGE_CACHE_SHIFT;
>  	unsigned offset = from & (PAGE_CACHE_SIZE-1);
> @@ -2773,22 +2761,22 @@ int block_truncate_page(struct address_s
>  	struct inode *inode = mapping->host;
>  	struct page *page;
>  	struct buffer_head *bh;
> -	int err;
> +	int err = 0;
>  
> -	blocksize = 1 << inode->i_blkbits;
> -	length = offset & (blocksize - 1);
> +	/* Page boundary? Nothing to do */
> +	if (!offset)
> +		goto out;
>  
> -	/* Block boundary? Nothing to do */
> -	if (!length)
> -		return 0;
> +	blocksize = 1 << inode->i_blkbits;
>  
> -	length = blocksize - length;
> +	length = (unsigned)min_t(loff_t, to - from, PAGE_CACHE_SIZE - offset);
>  	iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
>  	page = grab_cache_page(mapping, index);
> -	err = -ENOMEM;
> -	if (!page)
> +	if (!page) {
> +		err = -ENOMEM;
>  		goto out;
> +	}
>  
>  	if (!page_has_buffers(page))
>  		create_empty_buffers(page, blocksize, 0);
> @@ -2802,15 +2790,20 @@ int block_truncate_page(struct address_s
>  		pos += blocksize;
>  	}
>  
> -	err = 0;
>  	if (!buffer_mapped(bh)) {
>  		WARN_ON(bh->b_size != blocksize);
>  		err = get_block(inode, iblock, bh, 0);
>  		if (err)
>  			goto unlock;
> -		/* unmapped? It's a hole - nothing to do */
> -		if (!buffer_mapped(bh))
> +		/*
> +		 * unmapped? It's a hole - must zero out partial
> +		 * in the case of an extending truncate where mmap has
> +		 * previously written past i_size of the page
> +		 */
> +		if (!buffer_mapped(bh)) {
> +			zero_user(page, offset, length);
>  			goto unlock;
  Hmm, but who was zeroing out the page previously? Because the end of the
page gets zeroed already now...

> +		}
>  	}
>  
>  	/* Ok, it's mapped. Make sure it's up-to-date */
> @@ -2818,17 +2811,17 @@ int block_truncate_page(struct address_s
>  		set_buffer_uptodate(bh);
>  
>  	if (!buffer_uptodate(bh) && !buffer_delay(bh) && !buffer_unwritten(bh)) {
> -		err = -EIO;
>  		ll_rw_block(READ, 1, &bh);
>  		wait_on_buffer(bh);
>  		/* Uhhuh. Read error. Complain and punt. */
> -		if (!buffer_uptodate(bh))
> +		if (!buffer_uptodate(bh)) {
> +			err = -EIO;
>  			goto unlock;
> +		}
>  	}
>  
>  	zero_user(page, offset, length);
>  	mark_buffer_dirty(bh);
> -	err = 0;
>  
>  unlock:
>  	unlock_page(page);
> @@ -2836,6 +2829,19 @@ unlock:
>  out:
>  	return err;
>  }
> +EXPORT_SYMBOL(__block_truncate_page);
> +
> +int block_truncate_page(struct address_space *mapping,
> +			loff_t from, get_block_t *get_block)
> +{
> +	struct inode *inode = mapping->host;
> +	int err = 0;
> +
> +	if (from > inode->i_size)
> +		err = __block_truncate_page(mapping, inode->i_size, from, get_block);
> +
> +	return err;
> +}
>  
>  /*
>   * The generic ->writepage function for buffer-backed address_spaces
> @@ -2849,26 +2855,14 @@ int block_write_full_page_endio(struct p
>  	const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
>  	unsigned offset;
>  
> -	/* Is the page fully inside i_size? */
> -	if (page->index < end_index)
> -		return __block_write_full_page(inode, page, get_block, wbc,
> -					       handler);
> -
>  	/* Is the page fully outside i_size? (truncate in progress) */
>  	offset = i_size & (PAGE_CACHE_SIZE-1);
> -	if (page->index >= end_index+1 || !offset) {
> +	if (page->index >= end_index &&
> +			(page->index >= end_index+1 || !offset)) {
>  		unlock_page(page);
>  		return 0;
>  	}
>  
> -	/*
> -	 * The page straddles i_size.  It must be zeroed out on each and every
> -	 * writepage invokation because it may be mmapped.  "A file is mapped
> -	 * in multiples of the page size.  For a file that is not a multiple of
> -	 * the  page size, the remaining memory is zeroed when mapped, and
> -	 * writes to that region are not written out to the file."
> -	 */
> -	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
>  	return __block_write_full_page(inode, page, get_block, wbc, handler);
>  }
  I suppose you should also update __block_write_full_page() - there's
comment about zeroing. Also I'm not sure that marking buffer as uptodate
there is a good idea when the buffer isn't zeroed.

> Index: linux-2.6/mm/filemap_xip.c
> ===================================================================
> --- linux-2.6.orig/mm/filemap_xip.c
> +++ linux-2.6/mm/filemap_xip.c
> @@ -440,22 +440,23 @@ EXPORT_SYMBOL_GPL(xip_file_write);
>  int
>  xip_truncate_page(struct address_space *mapping, loff_t from)
>  {
> +	struct inode *inode = mapping->host;
>  	pgoff_t index = from >> PAGE_CACHE_SHIFT;
>  	unsigned offset = from & (PAGE_CACHE_SIZE-1);
>  	unsigned blocksize;
>  	unsigned length;
>  	void *xip_mem;
>  	unsigned long xip_pfn;
> -	int err;
> +	int err = 0;
>  
>  	BUG_ON(!mapping->a_ops->get_xip_mem);
>  
> -	blocksize = 1 << mapping->host->i_blkbits;
> +	blocksize = 1 << inode->i_blkbits;
>  	length = offset & (blocksize - 1);
>  
>  	/* Block boundary? Nothing to do */
>  	if (!length)
> -		return 0;
> +		goto out;
>  
>  	length = blocksize - length;
>  
> @@ -464,11 +465,11 @@ xip_truncate_page(struct address_space *
>  	if (unlikely(err)) {
>  		if (err == -ENODATA)
>  			/* Hole? No need to truncate */
> -			return 0;
> -		else
> -			return err;
> +			err = 0;
> +		goto out;
>  	}
>  	memset(xip_mem + offset, 0, length);
> -	return 0;
> +out:
> +	return err;
>  }
>  EXPORT_SYMBOL_GPL(xip_truncate_page);
  Again, only a style change, right?

> Index: linux-2.6/fs/mpage.c
> ===================================================================
> --- linux-2.6.orig/fs/mpage.c
> +++ linux-2.6/fs/mpage.c
> @@ -559,19 +559,10 @@ static int __mpage_writepage(struct page
>  page_is_mapped:
>  	end_index = i_size >> PAGE_CACHE_SHIFT;
>  	if (page->index >= end_index) {
> -		/*
> -		 * The page straddles i_size.  It must be zeroed out on each
> -		 * and every writepage invokation because it may be mmapped.
> -		 * "A file is mapped in multiples of the page size.  For a file
> -		 * that is not a multiple of the page size, the remaining memory
> -		 * is zeroed when mapped, and writes to that region are not
> -		 * written out to the file."
> -		 */
>  		unsigned offset = i_size & (PAGE_CACHE_SIZE - 1);
>  
> -		if (page->index > end_index || !offset)
> -			goto confused;
> -		zero_user_segment(page, offset, PAGE_CACHE_SIZE);
> +		if (page->index >= end_index+1 || !offset)
> +			goto confused; /* page fully outside i_size */
>  	}
>  
>  	/*
> Index: linux-2.6/fs/ext2/inode.c
> ===================================================================
> --- linux-2.6.orig/fs/ext2/inode.c
> +++ linux-2.6/fs/ext2/inode.c
> @@ -777,14 +777,40 @@ ext2_write_begin(struct file *file, stru
>  	return ret;
>  }
>  
> +int __block_truncate_page(struct address_space *mapping,
> +			loff_t from, loff_t to, get_block_t *get_block);
  Uf, that's ugly... Shouldn't it be in some header?

>  static int ext2_write_end(struct file *file, struct address_space *mapping,
>  			loff_t pos, unsigned len, unsigned copied,
>  			struct page *page, void *fsdata)
>  {
> +	struct inode *inode = mapping->host;
>  	int ret;
>  
> -	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
> -	if (ret < len) {
> +	ret = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> +	unlock_page(page);
> +	page_cache_release(page);
> +        if (pos+copied > inode->i_size) {
> +		int err;
> +                if (pos > inode->i_size) {
> +                        /* expanding a hole */
> +			err = __block_truncate_page(mapping, inode->i_size,
> +						pos, ext2_get_block);
> +			if (err) {
> +				ret = err;
> +				goto out;
> +			}
> +			err = __block_truncate_page(mapping, pos+copied,
> +						LLONG_MAX, ext2_get_block);
> +			if (err) {
> +				ret = err;
> +				goto out;
> +			}
> +                }
> +                i_size_write(inode, pos+copied);
> +                mark_inode_dirty(inode);
> +        }
> +out:
> +	if (ret < 0 || ret < len) {
>  		struct inode *inode = mapping->host;
>  		loff_t isize = inode->i_size;
>  		if (pos + len > isize)
  There are whitespace problems above... Also calling __block_truncate_page()
on old i_size looks strange - we just want to zero-out the page if it
exists (this way we'd unnecessarily read it from disk). Also I think
block_write_end() should do this.
  Finally, zeroing after pos+copied does not make sence to be conditioned
by pos > inode->i_size and again I don't think it's needed...

> Index: linux-2.6/mm/truncate.c
> ===================================================================
> --- linux-2.6.orig/mm/truncate.c
> +++ linux-2.6/mm/truncate.c
> @@ -49,7 +49,6 @@ void do_invalidatepage(struct page *page
>  
>  static inline void truncate_partial_page(struct page *page, unsigned partial)
>  {
> -	zero_user_segment(page, partial, PAGE_CACHE_SIZE);
>  	if (page_has_private(page))
>  		do_invalidatepage(page, partial);
>  }

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 0/5] new truncate sequence patchset
  2009-07-10  7:30 [patch 0/5] new truncate sequence patchset npiggin
                   ` (5 preceding siblings ...)
  2009-07-10  9:33   ` Nick Piggin
@ 2009-07-10 14:31 ` Christoph Hellwig
  6 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2009-07-10 14:31 UTC (permalink / raw)
  To: npiggin; +Cc: linux-fsdevel, hch, viro, jack

Here is an attempt at the XFS conversion.  It does call the full
setattr on the write failures as that's what has been tested with
a previous attempt at fixing those isuses.  I'll see if I can factor
our just the actual block truncation.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/3] fs: buffer_head writepage no zero
  2009-07-10 11:46       ` Jan Kara
@ 2009-07-13  6:54         ` Nick Piggin
  -1 siblings, 0 replies; 28+ messages in thread
From: Nick Piggin @ 2009-07-13  6:54 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, hch, viro, linux-mm

On Fri, Jul 10, 2009 at 01:46:51PM +0200, Jan Kara wrote:
> On Fri 10-07-09 11:34:03, Nick Piggin wrote:
> > 
> > When writing a page to filesystem, buffer.c zeroes out parts of the page past
> > i_size in an attempt to get zeroes into those blocks on disk, so as to honour
> > the requirement that an expanding truncate should zero-fill the file.
> > 
> > Unfortunately, this is racy. The reason we can get something other than
> > zeroes here is via an mmaped write to the block beyond i_size. Zeroing it
> > out before writepage narrows the window, but it is still possible to store
> > junk beyond i_size on disk, by storing into the page after writepage zeroes,
> > but before DMA (or copy) completes. This allows process A to break posix
> > semantics for process B (or even inadvertently for itsef).
> > 
> > It could also be possible that the filesystem has written data into the
> > block but not yet expanded the inode size when the system crashes for
> > some reason. Unless its journal reply / fsck process etc checks for this
> > condition, it could also cause subsequent breakage in semantics.
>   Actually, it should be possible to fix the posix semantics by zeroing out
> the page when i_size is going to be extended - hmm, I see you're trying to
> do something like that in ext2 code. Ugh. Since we have to lock the

Yeah, it could probably do it in write_begin in generic code, that
part was a bit ugly.

> old last page to make mkwrite work anyway, I think we should do it in a
> generic code (probably in a separate patch and just note it here...).
>   I can include it in my mkwrite fixes when I port them on top of your
> patches.
> 
> > @@ -2752,7 +2741,6 @@ has_buffers:
> >  	}
> >  	zero_user(page, offset, length);
> >  	set_page_dirty(page);
> > -	err = 0;
> >  
> >  unlock:
> >  	unlock_page(page);
>   Above two chunks are just style cleanup, aren't they? Could you maybe separate
> it from the logical changes?

Yes I think so. They devolved from something that was actually useful,
and I should remove them.

 
> > @@ -2802,15 +2790,20 @@ int block_truncate_page(struct address_s
> >  		pos += blocksize;
> >  	}
> >  
> > -	err = 0;
> >  	if (!buffer_mapped(bh)) {
> >  		WARN_ON(bh->b_size != blocksize);
> >  		err = get_block(inode, iblock, bh, 0);
> >  		if (err)
> >  			goto unlock;
> > -		/* unmapped? It's a hole - nothing to do */
> > -		if (!buffer_mapped(bh))
> > +		/*
> > +		 * unmapped? It's a hole - must zero out partial
> > +		 * in the case of an extending truncate where mmap has
> > +		 * previously written past i_size of the page
> > +		 */
> > +		if (!buffer_mapped(bh)) {
> > +			zero_user(page, offset, length);
> >  			goto unlock;
>   Hmm, but who was zeroing out the page previously? Because the end of the
> page gets zeroed already now...

Yes it does aready get zeroed, however I think ftruncate semantics
say that expanding ftruncate shoud leave the new area with zero
filled. A partial-mmap on the last page could have dirtied these
parts of the page and so break the guarantee.

I guess it could be ignored because such partial mmap writes are
supposed to result in undefined behaviour, however I think it is
a bit wrong (also the result could change based on memory pressure
even when another program opens the file, so I think zeroing here
is best).
 

> > -	/*
> > -	 * The page straddles i_size.  It must be zeroed out on each and every
> > -	 * writepage invokation because it may be mmapped.  "A file is mapped
> > -	 * in multiples of the page size.  For a file that is not a multiple of
> > -	 * the  page size, the remaining memory is zeroed when mapped, and
> > -	 * writes to that region are not written out to the file."
> > -	 */
> > -	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
> >  	return __block_write_full_page(inode, page, get_block, wbc, handler);
> >  }
>   I suppose you should also update __block_write_full_page() - there's
> comment about zeroing. Also I'm not sure that marking buffer as uptodate
> there is a good idea when the buffer isn't zeroed.

Thanks I'll check it out.


> >  EXPORT_SYMBOL_GPL(xip_truncate_page);
>   Again, only a style change, right?

Yes.

> > Index: linux-2.6/fs/ext2/inode.c
> > ===================================================================
> > --- linux-2.6.orig/fs/ext2/inode.c
> > +++ linux-2.6/fs/ext2/inode.c
> > @@ -777,14 +777,40 @@ ext2_write_begin(struct file *file, stru
> >  	return ret;
> >  }
> >  
> > +int __block_truncate_page(struct address_space *mapping,
> > +			loff_t from, loff_t to, get_block_t *get_block);
>   Uf, that's ugly... Shouldn't it be in some header?
> 
> >  static int ext2_write_end(struct file *file, struct address_space *mapping,
> >  			loff_t pos, unsigned len, unsigned copied,
> >  			struct page *page, void *fsdata)
> >  {
> > +	struct inode *inode = mapping->host;
> >  	int ret;
> >  
> > -	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
> > -	if (ret < len) {
> > +	ret = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> > +	unlock_page(page);
> > +	page_cache_release(page);
> > +        if (pos+copied > inode->i_size) {
> > +		int err;
> > +                if (pos > inode->i_size) {
> > +                        /* expanding a hole */
> > +			err = __block_truncate_page(mapping, inode->i_size,
> > +						pos, ext2_get_block);
> > +			if (err) {
> > +				ret = err;
> > +				goto out;
> > +			}
> > +			err = __block_truncate_page(mapping, pos+copied,
> > +						LLONG_MAX, ext2_get_block);
> > +			if (err) {
> > +				ret = err;
> > +				goto out;
> > +			}
> > +                }
> > +                i_size_write(inode, pos+copied);
> > +                mark_inode_dirty(inode);
> > +        }
> > +out:
> > +	if (ret < 0 || ret < len) {
> >  		struct inode *inode = mapping->host;
> >  		loff_t isize = inode->i_size;
> >  		if (pos + len > isize)
>   There are whitespace problems above... Also calling __block_truncate_page()
> on old i_size looks strange - we just want to zero-out the page if it
> exists (this way we'd unnecessarily read it from disk). Also I think
> block_write_end() should do this.
>   Finally, zeroing after pos+copied does not make sence to be conditioned
> by pos > inode->i_size and again I don't think it's needed...

Yeah this part was ugly because it was just a result of working
through bugs and I didn't really try to make it nice. I agree if
we can move as much as possible to generic code it woud be
best.

Thanks for review. I'll try to post another version soon.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/3] fs: buffer_head writepage no zero
@ 2009-07-13  6:54         ` Nick Piggin
  0 siblings, 0 replies; 28+ messages in thread
From: Nick Piggin @ 2009-07-13  6:54 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, hch, viro, linux-mm

On Fri, Jul 10, 2009 at 01:46:51PM +0200, Jan Kara wrote:
> On Fri 10-07-09 11:34:03, Nick Piggin wrote:
> > 
> > When writing a page to filesystem, buffer.c zeroes out parts of the page past
> > i_size in an attempt to get zeroes into those blocks on disk, so as to honour
> > the requirement that an expanding truncate should zero-fill the file.
> > 
> > Unfortunately, this is racy. The reason we can get something other than
> > zeroes here is via an mmaped write to the block beyond i_size. Zeroing it
> > out before writepage narrows the window, but it is still possible to store
> > junk beyond i_size on disk, by storing into the page after writepage zeroes,
> > but before DMA (or copy) completes. This allows process A to break posix
> > semantics for process B (or even inadvertently for itsef).
> > 
> > It could also be possible that the filesystem has written data into the
> > block but not yet expanded the inode size when the system crashes for
> > some reason. Unless its journal reply / fsck process etc checks for this
> > condition, it could also cause subsequent breakage in semantics.
>   Actually, it should be possible to fix the posix semantics by zeroing out
> the page when i_size is going to be extended - hmm, I see you're trying to
> do something like that in ext2 code. Ugh. Since we have to lock the

Yeah, it could probably do it in write_begin in generic code, that
part was a bit ugly.

> old last page to make mkwrite work anyway, I think we should do it in a
> generic code (probably in a separate patch and just note it here...).
>   I can include it in my mkwrite fixes when I port them on top of your
> patches.
> 
> > @@ -2752,7 +2741,6 @@ has_buffers:
> >  	}
> >  	zero_user(page, offset, length);
> >  	set_page_dirty(page);
> > -	err = 0;
> >  
> >  unlock:
> >  	unlock_page(page);
>   Above two chunks are just style cleanup, aren't they? Could you maybe separate
> it from the logical changes?

Yes I think so. They devolved from something that was actually useful,
and I should remove them.

 
> > @@ -2802,15 +2790,20 @@ int block_truncate_page(struct address_s
> >  		pos += blocksize;
> >  	}
> >  
> > -	err = 0;
> >  	if (!buffer_mapped(bh)) {
> >  		WARN_ON(bh->b_size != blocksize);
> >  		err = get_block(inode, iblock, bh, 0);
> >  		if (err)
> >  			goto unlock;
> > -		/* unmapped? It's a hole - nothing to do */
> > -		if (!buffer_mapped(bh))
> > +		/*
> > +		 * unmapped? It's a hole - must zero out partial
> > +		 * in the case of an extending truncate where mmap has
> > +		 * previously written past i_size of the page
> > +		 */
> > +		if (!buffer_mapped(bh)) {
> > +			zero_user(page, offset, length);
> >  			goto unlock;
>   Hmm, but who was zeroing out the page previously? Because the end of the
> page gets zeroed already now...

Yes it does aready get zeroed, however I think ftruncate semantics
say that expanding ftruncate shoud leave the new area with zero
filled. A partial-mmap on the last page could have dirtied these
parts of the page and so break the guarantee.

I guess it could be ignored because such partial mmap writes are
supposed to result in undefined behaviour, however I think it is
a bit wrong (also the result could change based on memory pressure
even when another program opens the file, so I think zeroing here
is best).
 

> > -	/*
> > -	 * The page straddles i_size.  It must be zeroed out on each and every
> > -	 * writepage invokation because it may be mmapped.  "A file is mapped
> > -	 * in multiples of the page size.  For a file that is not a multiple of
> > -	 * the  page size, the remaining memory is zeroed when mapped, and
> > -	 * writes to that region are not written out to the file."
> > -	 */
> > -	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
> >  	return __block_write_full_page(inode, page, get_block, wbc, handler);
> >  }
>   I suppose you should also update __block_write_full_page() - there's
> comment about zeroing. Also I'm not sure that marking buffer as uptodate
> there is a good idea when the buffer isn't zeroed.

Thanks I'll check it out.


> >  EXPORT_SYMBOL_GPL(xip_truncate_page);
>   Again, only a style change, right?

Yes.

> > Index: linux-2.6/fs/ext2/inode.c
> > ===================================================================
> > --- linux-2.6.orig/fs/ext2/inode.c
> > +++ linux-2.6/fs/ext2/inode.c
> > @@ -777,14 +777,40 @@ ext2_write_begin(struct file *file, stru
> >  	return ret;
> >  }
> >  
> > +int __block_truncate_page(struct address_space *mapping,
> > +			loff_t from, loff_t to, get_block_t *get_block);
>   Uf, that's ugly... Shouldn't it be in some header?
> 
> >  static int ext2_write_end(struct file *file, struct address_space *mapping,
> >  			loff_t pos, unsigned len, unsigned copied,
> >  			struct page *page, void *fsdata)
> >  {
> > +	struct inode *inode = mapping->host;
> >  	int ret;
> >  
> > -	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
> > -	if (ret < len) {
> > +	ret = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> > +	unlock_page(page);
> > +	page_cache_release(page);
> > +        if (pos+copied > inode->i_size) {
> > +		int err;
> > +                if (pos > inode->i_size) {
> > +                        /* expanding a hole */
> > +			err = __block_truncate_page(mapping, inode->i_size,
> > +						pos, ext2_get_block);
> > +			if (err) {
> > +				ret = err;
> > +				goto out;
> > +			}
> > +			err = __block_truncate_page(mapping, pos+copied,
> > +						LLONG_MAX, ext2_get_block);
> > +			if (err) {
> > +				ret = err;
> > +				goto out;
> > +			}
> > +                }
> > +                i_size_write(inode, pos+copied);
> > +                mark_inode_dirty(inode);
> > +        }
> > +out:
> > +	if (ret < 0 || ret < len) {
> >  		struct inode *inode = mapping->host;
> >  		loff_t isize = inode->i_size;
> >  		if (pos + len > isize)
>   There are whitespace problems above... Also calling __block_truncate_page()
> on old i_size looks strange - we just want to zero-out the page if it
> exists (this way we'd unnecessarily read it from disk). Also I think
> block_write_end() should do this.
>   Finally, zeroing after pos+copied does not make sence to be conditioned
> by pos > inode->i_size and again I don't think it's needed...

Yeah this part was ugly because it was just a result of working
through bugs and I didn't really try to make it nice. I agree if
we can move as much as possible to generic code it woud be
best.

Thanks for review. I'll try to post another version soon.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 5/5] ext2: convert to use the new truncate convention.
  2009-07-10  7:30 ` [patch 5/5] ext2: " npiggin
@ 2009-09-01 18:29   ` Jan Kara
  2009-09-02  9:14     ` Nick Piggin
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Kara @ 2009-09-01 18:29 UTC (permalink / raw)
  To: npiggin; +Cc: linux-fsdevel, hch, viro, jack, linux-ext4, Andrew Morton

  Hi Nick,

On Fri 10-07-09 17:30:33, npiggin@suse.de wrote:
> I have some questions, marked with XXX.
> 
> Cc: linux-ext4@vger.kernel.org
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
>  fs/ext2/ext2.h  |    1 
>  fs/ext2/file.c  |    2 
>  fs/ext2/inode.c |  138 +++++++++++++++++++++++++++++++++++++++++++++-----------
>  3 files changed, 113 insertions(+), 28 deletions(-)
> 
> Index: linux-2.6/fs/ext2/inode.c
> ===================================================================
> --- linux-2.6.orig/fs/ext2/inode.c
> +++ linux-2.6/fs/ext2/inode.c
...
> +static void ext2_truncate_blocks(struct inode *inode, loff_t offset)
> +{
> +	/*
> +	 * XXX: it seems like a bug here that we don't allow
> +	 * IS_APPEND inode to have blocks-past-i_size trimmed off.
> +	 * review and fix this.
> +	 */
> +	if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
> +	    S_ISLNK(inode->i_mode)))
> +		return -EINVAL;
> +	if (ext2_inode_is_fast_symlink(inode))
> +		return -EINVAL;
> +	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
> +		return -EPERM;
> +	__ext2_truncate_blocks(inode, offset);
> +}
  Actually, looking again into this, I think you've introduced a subtle bug
into the code. When a write fails for some reason, we did vmtruncate()
previously which called block_truncate_page() which zeroed a tail of the
last block. Now, ext2_truncate_blocks() does not do this and I think it
could be a problem because at least in direct IO case, we could have written
some data into that block on disk.
  We really rely on the tail of the block being zeroed because otherwise
extending truncate will show those old data in the block. I plan to change
that in my mkwrite fixes but until then, we should preserve the old
assumptions.
  So I think that ext2_truncate_blocks() should do all that tail page magic
as well (although it's not needed in all cases).

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 5/5] ext2: convert to use the new truncate convention.
  2009-09-01 18:29   ` Jan Kara
@ 2009-09-02  9:14     ` Nick Piggin
  2009-09-02 11:14       ` Jan Kara
  0 siblings, 1 reply; 28+ messages in thread
From: Nick Piggin @ 2009-09-02  9:14 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, hch, viro, linux-ext4, Andrew Morton

On Tue, Sep 01, 2009 at 08:29:29PM +0200, Jan Kara wrote:
>   Hi Nick,
> 
> On Fri 10-07-09 17:30:33, npiggin@suse.de wrote:
> > I have some questions, marked with XXX.
> > 
> > Cc: linux-ext4@vger.kernel.org
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > ---
> >  fs/ext2/ext2.h  |    1 
> >  fs/ext2/file.c  |    2 
> >  fs/ext2/inode.c |  138 +++++++++++++++++++++++++++++++++++++++++++++-----------
> >  3 files changed, 113 insertions(+), 28 deletions(-)
> > 
> > Index: linux-2.6/fs/ext2/inode.c
> > ===================================================================
> > --- linux-2.6.orig/fs/ext2/inode.c
> > +++ linux-2.6/fs/ext2/inode.c
> ...
> > +static void ext2_truncate_blocks(struct inode *inode, loff_t offset)
> > +{
> > +	/*
> > +	 * XXX: it seems like a bug here that we don't allow
> > +	 * IS_APPEND inode to have blocks-past-i_size trimmed off.
> > +	 * review and fix this.
> > +	 */
> > +	if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
> > +	    S_ISLNK(inode->i_mode)))
> > +		return -EINVAL;
> > +	if (ext2_inode_is_fast_symlink(inode))
> > +		return -EINVAL;
> > +	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
> > +		return -EPERM;
> > +	__ext2_truncate_blocks(inode, offset);
> > +}
>   Actually, looking again into this, I think you've introduced a subtle bug
> into the code. When a write fails for some reason, we did vmtruncate()
> previously which called block_truncate_page() which zeroed a tail of the
> last block. Now, ext2_truncate_blocks() does not do this and I think it
> could be a problem because at least in direct IO case, we could have written
> some data into that block on disk.
>   We really rely on the tail of the block being zeroed because otherwise
> extending truncate will show those old data in the block. I plan to change
> that in my mkwrite fixes but until then, we should preserve the old
> assumptions.
>   So I think that ext2_truncate_blocks() should do all that tail page magic
> as well (although it's not needed in all cases).

Hi Jan,

Yeah I did think about this and yes we usually do need to zero out
the page but for these error cases with normal writes we shouldn't
write anything in there.  For direct IO... I don't see the problem
because we're not coherent with pagecache anyway...

Hmm, but possiby it is a good idea just to keep the same block_truncate_page
calls for this patchset and we can look at it again with your truncate
patches. I'll work on fixing these up.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 5/5] ext2: convert to use the new truncate convention.
  2009-09-02  9:14     ` Nick Piggin
@ 2009-09-02 11:14       ` Jan Kara
  0 siblings, 0 replies; 28+ messages in thread
From: Jan Kara @ 2009-09-02 11:14 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, hch, viro, linux-ext4, Andrew Morton

On Wed 02-09-09 11:14:26, Nick Piggin wrote:
> On Tue, Sep 01, 2009 at 08:29:29PM +0200, Jan Kara wrote:
> >   Hi Nick,
> > 
> > On Fri 10-07-09 17:30:33, npiggin@suse.de wrote:
> > > I have some questions, marked with XXX.
> > > 
> > > Cc: linux-ext4@vger.kernel.org
> > > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > > ---
> > >  fs/ext2/ext2.h  |    1 
> > >  fs/ext2/file.c  |    2 
> > >  fs/ext2/inode.c |  138 +++++++++++++++++++++++++++++++++++++++++++++-----------
> > >  3 files changed, 113 insertions(+), 28 deletions(-)
> > > 
> > > Index: linux-2.6/fs/ext2/inode.c
> > > ===================================================================
> > > --- linux-2.6.orig/fs/ext2/inode.c
> > > +++ linux-2.6/fs/ext2/inode.c
> > ...
> > > +static void ext2_truncate_blocks(struct inode *inode, loff_t offset)
> > > +{
> > > +	/*
> > > +	 * XXX: it seems like a bug here that we don't allow
> > > +	 * IS_APPEND inode to have blocks-past-i_size trimmed off.
> > > +	 * review and fix this.
> > > +	 */
> > > +	if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
> > > +	    S_ISLNK(inode->i_mode)))
> > > +		return -EINVAL;
> > > +	if (ext2_inode_is_fast_symlink(inode))
> > > +		return -EINVAL;
> > > +	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
> > > +		return -EPERM;
> > > +	__ext2_truncate_blocks(inode, offset);
> > > +}
> >   Actually, looking again into this, I think you've introduced a subtle bug
> > into the code. When a write fails for some reason, we did vmtruncate()
> > previously which called block_truncate_page() which zeroed a tail of the
> > last block. Now, ext2_truncate_blocks() does not do this and I think it
> > could be a problem because at least in direct IO case, we could have written
> > some data into that block on disk.
> >   We really rely on the tail of the block being zeroed because otherwise
> > extending truncate will show those old data in the block. I plan to change
> > that in my mkwrite fixes but until then, we should preserve the old
> > assumptions.
> >   So I think that ext2_truncate_blocks() should do all that tail page magic
> > as well (although it's not needed in all cases).
> 
> Hi Jan,
> 
> Yeah I did think about this and yes we usually do need to zero out
> the page but for these error cases with normal writes we shouldn't
> write anything in there.  For direct IO... I don't see the problem
> because we're not coherent with pagecache anyway...
  We are not coherent but that's irrelevant - if a failed direct write
is followed by an extending truncate and read, it will read the block
from disk and could see non-zeros where there should be zeros...

> Hmm, but possiby it is a good idea just to keep the same block_truncate_page
> calls for this patchset and we can look at it again with your truncate
> patches. I'll work on fixing these up.
  Yes, I think that keeping this change for a separate patch is definitely
better.

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 5/5] ext2: convert to use the new truncate convention.
  2009-08-17 11:09       ` Boaz Harrosh
@ 2009-08-17 16:44         ` Nick Piggin
  0 siblings, 0 replies; 28+ messages in thread
From: Nick Piggin @ 2009-08-17 16:44 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christoph Hellwig, Andrew Morton, linux-fsdevel, linux-ext4,
	Christoph Hellwig

On Mon, Aug 17, 2009 at 02:09:54PM +0300, Boaz Harrosh wrote:
> On 08/17/2009 09:42 AM, Nick Piggin wrote:
> > On Sun, Aug 16, 2009 at 04:16:26PM -0400, Christoph Hellwig wrote:
> >> On Sun, Aug 16, 2009 at 08:25:38PM +1000, npiggin@suse.de wrote:
> >>> @@ -66,9 +68,10 @@ void ext2_delete_inode (struct inode * i
> >>>  	mark_inode_dirty(inode);
> >>>  	ext2_write_inode(inode, inode_needs_sync(inode));
> >>>  
> >>> +	/* XXX: where is truncate_inode_pages? */
> >>>  	inode->i_size = 0;
> >>>  	if (inode->i_blocks)
> >>> -		ext2_truncate (inode);
> >>> +		ext2_truncate_blocks(inode, 0);
> >>>  	ext2_free_inode (inode);
> >>
> >> At the beginning of the function, just before the diff window.  Because
> >> this is ->delete_inode we truncate away all pages, down to offset 0.
> > 
> > OK, weird. I thought I couldn't see it when I wrote that :) maybe my
> > tree was corrupted or I'm stupid.
> > 
> > 
> >>> +static void ext2_truncate_blocks(struct inode *inode, loff_t offset)
> >>> +{
> >>> +	/*
> >>> +	 * XXX: it seems like a bug here that we don't allow
> >>> +	 * IS_APPEND inode to have blocks-past-i_size trimmed off.
> >>> +	 * review and fix this.
> >>> +	 */
> >>> +	if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
> >>> +	    S_ISLNK(inode->i_mode)))
> >>> +		return -EINVAL;
> >>> +	if (ext2_inode_is_fast_symlink(inode))
> >>> +		return -EINVAL;
> >>> +	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
> >>> +		return -EPERM;
> >>> +	__ext2_truncate_blocks(inode, offset);
> >>
> >> Yes, I think the IS_APPEND(inode) || IS_IMMUTABLE(inode) checks should move
> >> into ext2_setsize.  But let's leave that for a separate patch.
> > 
> > Yeah agreed.
> > 
> >  
> >> Btw, the above code gives me warnings like this:
> >>
> >> /home/hch/work/linux-2.6/fs/ext2/inode.c: In function
> >> 'ext2_truncate_blocks':
> >> /home/hch/work/linux-2.6/fs/ext2/inode.c:1158: warning: 'return' with a
> >> value, in function returning void
> >> /home/hch/work/linux-2.6/fs/ext2/inode.c:1160: warning: 'return' with a
> >> value, in function returning void
> >> /home/hch/work/linux-2.6/fs/ext2/inode.c:1162: warning: 'return' with a
> >> value, in function returning void
> >>
> >> because you try to return errors from a function delcared as void.
> > 
> > Hm, sorry. I thought it was in good shape... I'll recheck that I sent
> > out the correct versions and resend according to feedback from you
> > and Hugh.
> > 
> > Thanks,
> > Nick
> > 
> 
> Nick do you have a public tree with these latest patches? I would like to
> try out an exofs conversion and testing.

Hi Boaz,

I don't have a public tree. I can send you a rollup if you ping me or
just take the patches from the list. I will send out a new set shortly
(with very minor differences -- a couple of new helper functions to
use).

That will be great if you can convert your fs. I will really appreciate
it.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 5/5] ext2: convert to use the new truncate convention.
  2009-08-17  6:42     ` Nick Piggin
@ 2009-08-17 11:09       ` Boaz Harrosh
  2009-08-17 16:44         ` Nick Piggin
  0 siblings, 1 reply; 28+ messages in thread
From: Boaz Harrosh @ 2009-08-17 11:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Hellwig, Andrew Morton, linux-fsdevel, linux-ext4,
	Christoph Hellwig

On 08/17/2009 09:42 AM, Nick Piggin wrote:
> On Sun, Aug 16, 2009 at 04:16:26PM -0400, Christoph Hellwig wrote:
>> On Sun, Aug 16, 2009 at 08:25:38PM +1000, npiggin@suse.de wrote:
>>> @@ -66,9 +68,10 @@ void ext2_delete_inode (struct inode * i
>>>  	mark_inode_dirty(inode);
>>>  	ext2_write_inode(inode, inode_needs_sync(inode));
>>>  
>>> +	/* XXX: where is truncate_inode_pages? */
>>>  	inode->i_size = 0;
>>>  	if (inode->i_blocks)
>>> -		ext2_truncate (inode);
>>> +		ext2_truncate_blocks(inode, 0);
>>>  	ext2_free_inode (inode);
>>
>> At the beginning of the function, just before the diff window.  Because
>> this is ->delete_inode we truncate away all pages, down to offset 0.
> 
> OK, weird. I thought I couldn't see it when I wrote that :) maybe my
> tree was corrupted or I'm stupid.
> 
> 
>>> +static void ext2_truncate_blocks(struct inode *inode, loff_t offset)
>>> +{
>>> +	/*
>>> +	 * XXX: it seems like a bug here that we don't allow
>>> +	 * IS_APPEND inode to have blocks-past-i_size trimmed off.
>>> +	 * review and fix this.
>>> +	 */
>>> +	if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
>>> +	    S_ISLNK(inode->i_mode)))
>>> +		return -EINVAL;
>>> +	if (ext2_inode_is_fast_symlink(inode))
>>> +		return -EINVAL;
>>> +	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
>>> +		return -EPERM;
>>> +	__ext2_truncate_blocks(inode, offset);
>>
>> Yes, I think the IS_APPEND(inode) || IS_IMMUTABLE(inode) checks should move
>> into ext2_setsize.  But let's leave that for a separate patch.
> 
> Yeah agreed.
> 
>  
>> Btw, the above code gives me warnings like this:
>>
>> /home/hch/work/linux-2.6/fs/ext2/inode.c: In function
>> 'ext2_truncate_blocks':
>> /home/hch/work/linux-2.6/fs/ext2/inode.c:1158: warning: 'return' with a
>> value, in function returning void
>> /home/hch/work/linux-2.6/fs/ext2/inode.c:1160: warning: 'return' with a
>> value, in function returning void
>> /home/hch/work/linux-2.6/fs/ext2/inode.c:1162: warning: 'return' with a
>> value, in function returning void
>>
>> because you try to return errors from a function delcared as void.
> 
> Hm, sorry. I thought it was in good shape... I'll recheck that I sent
> out the correct versions and resend according to feedback from you
> and Hugh.
> 
> Thanks,
> Nick
> 

Nick do you have a public tree with these latest patches? I would like to
try out an exofs conversion and testing.

Thanks
Boaz

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 5/5] ext2: convert to use the new truncate convention.
  2009-08-16 20:16   ` Christoph Hellwig
@ 2009-08-17  6:42     ` Nick Piggin
  2009-08-17 11:09       ` Boaz Harrosh
  0 siblings, 1 reply; 28+ messages in thread
From: Nick Piggin @ 2009-08-17  6:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, linux-fsdevel, linux-ext4, Christoph Hellwig

On Sun, Aug 16, 2009 at 04:16:26PM -0400, Christoph Hellwig wrote:
> On Sun, Aug 16, 2009 at 08:25:38PM +1000, npiggin@suse.de wrote:
> > @@ -66,9 +68,10 @@ void ext2_delete_inode (struct inode * i
> >  	mark_inode_dirty(inode);
> >  	ext2_write_inode(inode, inode_needs_sync(inode));
> >  
> > +	/* XXX: where is truncate_inode_pages? */
> >  	inode->i_size = 0;
> >  	if (inode->i_blocks)
> > -		ext2_truncate (inode);
> > +		ext2_truncate_blocks(inode, 0);
> >  	ext2_free_inode (inode);
> 
> At the beginning of the function, just before the diff window.  Because
> this is ->delete_inode we truncate away all pages, down to offset 0.

OK, weird. I thought I couldn't see it when I wrote that :) maybe my
tree was corrupted or I'm stupid.


> > +static void ext2_truncate_blocks(struct inode *inode, loff_t offset)
> > +{
> > +	/*
> > +	 * XXX: it seems like a bug here that we don't allow
> > +	 * IS_APPEND inode to have blocks-past-i_size trimmed off.
> > +	 * review and fix this.
> > +	 */
> > +	if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
> > +	    S_ISLNK(inode->i_mode)))
> > +		return -EINVAL;
> > +	if (ext2_inode_is_fast_symlink(inode))
> > +		return -EINVAL;
> > +	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
> > +		return -EPERM;
> > +	__ext2_truncate_blocks(inode, offset);
> 
> Yes, I think the IS_APPEND(inode) || IS_IMMUTABLE(inode) checks should move
> into ext2_setsize.  But let's leave that for a separate patch.

Yeah agreed.

 
> Btw, the above code gives me warnings like this:
> 
> /home/hch/work/linux-2.6/fs/ext2/inode.c: In function
> 'ext2_truncate_blocks':
> /home/hch/work/linux-2.6/fs/ext2/inode.c:1158: warning: 'return' with a
> value, in function returning void
> /home/hch/work/linux-2.6/fs/ext2/inode.c:1160: warning: 'return' with a
> value, in function returning void
> /home/hch/work/linux-2.6/fs/ext2/inode.c:1162: warning: 'return' with a
> value, in function returning void
> 
> because you try to return errors from a function delcared as void.

Hm, sorry. I thought it was in good shape... I'll recheck that I sent
out the correct versions and resend according to feedback from you
and Hugh.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 5/5] ext2: convert to use the new truncate convention.
  2009-08-16 10:25 ` [patch 5/5] ext2: convert to use the new truncate convention npiggin
@ 2009-08-16 20:16   ` Christoph Hellwig
  2009-08-17  6:42     ` Nick Piggin
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2009-08-16 20:16 UTC (permalink / raw)
  To: npiggin; +Cc: Andrew Morton, linux-fsdevel, linux-ext4, Christoph Hellwig

On Sun, Aug 16, 2009 at 08:25:38PM +1000, npiggin@suse.de wrote:
> @@ -66,9 +68,10 @@ void ext2_delete_inode (struct inode * i
>  	mark_inode_dirty(inode);
>  	ext2_write_inode(inode, inode_needs_sync(inode));
>  
> +	/* XXX: where is truncate_inode_pages? */
>  	inode->i_size = 0;
>  	if (inode->i_blocks)
> -		ext2_truncate (inode);
> +		ext2_truncate_blocks(inode, 0);
>  	ext2_free_inode (inode);

At the beginning of the function, just before the diff window.  Because
this is ->delete_inode we truncate away all pages, down to offset 0.

> +static void ext2_truncate_blocks(struct inode *inode, loff_t offset)
> +{
> +	/*
> +	 * XXX: it seems like a bug here that we don't allow
> +	 * IS_APPEND inode to have blocks-past-i_size trimmed off.
> +	 * review and fix this.
> +	 */
> +	if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
> +	    S_ISLNK(inode->i_mode)))
> +		return -EINVAL;
> +	if (ext2_inode_is_fast_symlink(inode))
> +		return -EINVAL;
> +	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
> +		return -EPERM;
> +	__ext2_truncate_blocks(inode, offset);

Yes, I think the IS_APPEND(inode) || IS_IMMUTABLE(inode) checks should move
into ext2_setsize.  But let's leave that for a separate patch.

Btw, the above code gives me warnings like this:

/home/hch/work/linux-2.6/fs/ext2/inode.c: In function
'ext2_truncate_blocks':
/home/hch/work/linux-2.6/fs/ext2/inode.c:1158: warning: 'return' with a
value, in function returning void
/home/hch/work/linux-2.6/fs/ext2/inode.c:1160: warning: 'return' with a
value, in function returning void
/home/hch/work/linux-2.6/fs/ext2/inode.c:1162: warning: 'return' with a
value, in function returning void

because you try to return errors from a function delcared as void.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 5/5] ext2: convert to use the new truncate convention.
  2009-08-16 10:25 [patch 0/5] new truncate sequence npiggin
@ 2009-08-16 10:25 ` npiggin
  2009-08-16 20:16   ` Christoph Hellwig
  0 siblings, 1 reply; 28+ messages in thread
From: npiggin @ 2009-08-16 10:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, linux-ext4, Christoph Hellwig

[-- Attachment #1: ext2-new-truncate.patch --]
[-- Type: text/plain, Size: 8258 bytes --]

I also have some questions, marked with XXX.

Cc: linux-ext4@vger.kernel.org
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 fs/ext2/ext2.h  |    1 
 fs/ext2/file.c  |    2 
 fs/ext2/inode.c |  138 +++++++++++++++++++++++++++++++++++++++++++++-----------
 3 files changed, 113 insertions(+), 28 deletions(-)

Index: linux-2.6/fs/ext2/inode.c
===================================================================
--- linux-2.6.orig/fs/ext2/inode.c
+++ linux-2.6/fs/ext2/inode.c
@@ -53,6 +53,8 @@ static inline int ext2_inode_is_fast_sym
 		inode->i_blocks - ea_blocks == 0);
 }
 
+static void ext2_truncate_blocks(struct inode *inode, loff_t offset);
+
 /*
  * Called at the last iput() if i_nlink is zero.
  */
@@ -66,9 +68,10 @@ void ext2_delete_inode (struct inode * i
 	mark_inode_dirty(inode);
 	ext2_write_inode(inode, inode_needs_sync(inode));
 
+	/* XXX: where is truncate_inode_pages? */
 	inode->i_size = 0;
 	if (inode->i_blocks)
-		ext2_truncate (inode);
+		ext2_truncate_blocks(inode, 0);
 	ext2_free_inode (inode);
 
 	return;
@@ -761,8 +764,33 @@ ext2_write_begin(struct file *file, stru
 		loff_t pos, unsigned len, unsigned flags,
 		struct page **pagep, void **fsdata)
 {
+	int ret;
+
 	*pagep = NULL;
-	return __ext2_write_begin(file, mapping, pos, len, flags, pagep,fsdata);
+	ret = __ext2_write_begin(file, mapping, pos, len, flags, pagep,fsdata);
+	if (ret < 0) {
+		struct inode *inode = mapping->host;
+		loff_t isize = inode->i_size;
+		if (pos + len > isize)
+			ext2_truncate_blocks(inode, isize);
+	}
+	return ret;
+}
+
+static int ext2_write_end(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned copied,
+			struct page *page, void *fsdata)
+{
+	int ret;
+
+	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
+	if (ret < len) {
+		struct inode *inode = mapping->host;
+		loff_t isize = inode->i_size;
+		if (pos + len > isize)
+			ext2_truncate_blocks(inode, isize);
+	}
+	return ret;
 }
 
 static int
@@ -770,13 +798,22 @@ ext2_nobh_write_begin(struct file *file,
 		loff_t pos, unsigned len, unsigned flags,
 		struct page **pagep, void **fsdata)
 {
+	int ret;
+
 	/*
 	 * Dir-in-pagecache still uses ext2_write_begin. Would have to rework
 	 * directory handling code to pass around offsets rather than struct
 	 * pages in order to make this work easily.
 	 */
-	return nobh_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+	ret = nobh_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
 							ext2_get_block);
+	if (ret < 0) {
+		struct inode *inode = mapping->host;
+		loff_t isize = inode->i_size;
+		if (pos + len > isize)
+			ext2_truncate_blocks(inode, isize);
+	}
+	return ret;
 }
 
 static int ext2_nobh_writepage(struct page *page,
@@ -796,9 +833,15 @@ ext2_direct_IO(int rw, struct kiocb *ioc
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
+	ssize_t ret;
 
-	return blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
+	ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
 				offset, nr_segs, ext2_get_block, NULL);
+	if (ret < 0 && (rw & WRITE)) {
+		loff_t isize = inode->i_size;
+		ext2_truncate_blocks(inode, isize);
+	}
+	return ret;
 }
 
 static int
@@ -813,7 +856,7 @@ const struct address_space_operations ex
 	.writepage		= ext2_writepage,
 	.sync_page		= block_sync_page,
 	.write_begin		= ext2_write_begin,
-	.write_end		= generic_write_end,
+	.write_end		= ext2_write_end,
 	.bmap			= ext2_bmap,
 	.direct_IO		= ext2_direct_IO,
 	.writepages		= ext2_writepages,
@@ -1020,7 +1063,7 @@ static void ext2_free_branches(struct in
 		ext2_free_data(inode, p, q);
 }
 
-void ext2_truncate(struct inode *inode)
+static void __ext2_truncate_blocks(struct inode *inode, loff_t offset)
 {
 	__le32 *i_data = EXT2_I(inode)->i_data;
 	struct ext2_inode_info *ei = EXT2_I(inode);
@@ -1032,27 +1075,8 @@ void ext2_truncate(struct inode *inode)
 	int n;
 	long iblock;
 	unsigned blocksize;
-
-	if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
-	    S_ISLNK(inode->i_mode)))
-		return;
-	if (ext2_inode_is_fast_symlink(inode))
-		return;
-	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
-		return;
-
 	blocksize = inode->i_sb->s_blocksize;
-	iblock = (inode->i_size + blocksize-1)
-					>> EXT2_BLOCK_SIZE_BITS(inode->i_sb);
-
-	if (mapping_is_xip(inode->i_mapping))
-		xip_truncate_page(inode->i_mapping, inode->i_size);
-	else if (test_opt(inode->i_sb, NOBH))
-		nobh_truncate_page(inode->i_mapping,
-				inode->i_size, ext2_get_block);
-	else
-		block_truncate_page(inode->i_mapping,
-				inode->i_size, ext2_get_block);
+	iblock = (offset + blocksize-1) >> EXT2_BLOCK_SIZE_BITS(inode->i_sb);
 
 	n = ext2_block_to_path(inode, iblock, offsets, NULL);
 	if (n == 0)
@@ -1120,6 +1144,59 @@ do_indirects:
 	ext2_discard_reservation(inode);
 
 	mutex_unlock(&ei->truncate_mutex);
+}
+
+static void ext2_truncate_blocks(struct inode *inode, loff_t offset)
+{
+	/*
+	 * XXX: it seems like a bug here that we don't allow
+	 * IS_APPEND inode to have blocks-past-i_size trimmed off.
+	 * review and fix this.
+	 */
+	if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
+	    S_ISLNK(inode->i_mode)))
+		return -EINVAL;
+	if (ext2_inode_is_fast_symlink(inode))
+		return -EINVAL;
+	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
+		return -EPERM;
+	__ext2_truncate_blocks(inode, offset);
+}
+
+int ext2_setsize(struct inode *inode, loff_t newsize)
+{
+	loff_t oldsize;
+	int error;
+
+	error = inode_newsize_ok(inode, newsize);
+	if (error)
+		return error;
+
+	if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
+	    S_ISLNK(inode->i_mode)))
+		return -EINVAL;
+	if (ext2_inode_is_fast_symlink(inode))
+		return -EINVAL;
+	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
+		return -EPERM;
+
+	if (mapping_is_xip(inode->i_mapping))
+		error = xip_truncate_page(inode->i_mapping, newsize);
+	else if (test_opt(inode->i_sb, NOBH))
+		error = nobh_truncate_page(inode->i_mapping,
+				newsize, ext2_get_block);
+	else
+		error = block_truncate_page(inode->i_mapping,
+				newsize, ext2_get_block);
+	if (error)
+		return error;
+
+	oldsize = inode->i_size;
+	i_size_write(inode, newsize);
+	truncate_pagecache(inode, oldsize, newsize);
+
+	__ext2_truncate_blocks(inode, newsize);
+
 	inode->i_mtime = inode->i_ctime = CURRENT_TIME_SEC;
 	if (inode_needs_sync(inode)) {
 		sync_mapping_buffers(inode->i_mapping);
@@ -1127,6 +1204,8 @@ do_indirects:
 	} else {
 		mark_inode_dirty(inode);
 	}
+
+	return 0;
 }
 
 static struct ext2_inode *ext2_get_inode(struct super_block *sb, ino_t ino,
@@ -1459,6 +1538,13 @@ int ext2_setattr(struct dentry *dentry,
 		if (error)
 			return error;
 	}
+	if (iattr->ia_valid & ATTR_SIZE) {
+		error = ext2_setsize(inode, iattr->ia_size);
+		if (error)
+			return error;
+		iattr->ia_valid &= ~ATTR_SIZE;
+		/* strip ATTR_SIZE for inode_setattr */
+	}
 	error = inode_setattr(inode, iattr);
 	if (!error && (iattr->ia_valid & ATTR_MODE))
 		error = ext2_acl_chmod(inode);
Index: linux-2.6/fs/ext2/ext2.h
===================================================================
--- linux-2.6.orig/fs/ext2/ext2.h
+++ linux-2.6/fs/ext2/ext2.h
@@ -122,7 +122,6 @@ extern int ext2_write_inode (struct inod
 extern void ext2_delete_inode (struct inode *);
 extern int ext2_sync_inode (struct inode *);
 extern int ext2_get_block(struct inode *, sector_t, struct buffer_head *, int);
-extern void ext2_truncate (struct inode *);
 extern int ext2_setattr (struct dentry *, struct iattr *);
 extern void ext2_set_inode_flags(struct inode *inode);
 extern void ext2_get_inode_flags(struct ext2_inode_info *);
Index: linux-2.6/fs/ext2/file.c
===================================================================
--- linux-2.6.orig/fs/ext2/file.c
+++ linux-2.6/fs/ext2/file.c
@@ -77,13 +77,13 @@ const struct file_operations ext2_xip_fi
 #endif
 
 const struct inode_operations ext2_file_inode_operations = {
-	.truncate	= ext2_truncate,
 #ifdef CONFIG_EXT2_FS_XATTR
 	.setxattr	= generic_setxattr,
 	.getxattr	= generic_getxattr,
 	.listxattr	= ext2_listxattr,
 	.removexattr	= generic_removexattr,
 #endif
+	.new_truncate	= 1,
 	.setattr	= ext2_setattr,
 	.permission	= ext2_permission,
 	.fiemap		= ext2_fiemap,



^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2009-09-02 11:14 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-07-10  7:30 [patch 0/5] new truncate sequence patchset npiggin
2009-07-10  7:30 ` [patch 1/5] fs: new truncate helpers npiggin
2009-07-10  7:30 ` [patch 2/5] fs: use " npiggin-l3A5Bk7waGM
2009-07-10  7:30   ` npiggin
2009-07-10  7:30 ` [patch 3/5] fs: introduce new truncate sequence npiggin
2009-07-10  7:30 ` [patch 4/5] tmpfs: convert to use the new truncate convention npiggin
2009-07-10  7:30 ` [patch 5/5] ext2: " npiggin
2009-09-01 18:29   ` Jan Kara
2009-09-02  9:14     ` Nick Piggin
2009-09-02 11:14       ` Jan Kara
2009-07-10  9:33 ` [patch 1/3] fs: buffer_head writepage no invalidate Nick Piggin
2009-07-10  9:33   ` Nick Piggin
2009-07-10  9:34   ` [patch 2/3] fs: buffer_head writepage no zero Nick Piggin
2009-07-10  9:34     ` Nick Piggin
2009-07-10 11:46     ` Jan Kara
2009-07-10 11:46       ` Jan Kara
2009-07-13  6:54       ` Nick Piggin
2009-07-13  6:54         ` Nick Piggin
2009-07-10  9:35   ` [patch 3/3] fs: buffer_head page_lock i_size relax Nick Piggin
2009-07-10  9:35     ` Nick Piggin
2009-07-10 11:08   ` [patch 1/3] fs: buffer_head writepage no invalidate Jan Kara
2009-07-10 11:08     ` Jan Kara
2009-07-10 14:31 ` [patch 0/5] new truncate sequence patchset Christoph Hellwig
2009-08-16 10:25 [patch 0/5] new truncate sequence npiggin
2009-08-16 10:25 ` [patch 5/5] ext2: convert to use the new truncate convention npiggin
2009-08-16 20:16   ` Christoph Hellwig
2009-08-17  6:42     ` Nick Piggin
2009-08-17 11:09       ` Boaz Harrosh
2009-08-17 16:44         ` Nick Piggin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.