linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/10] Fix page_mkwrite() for blocksize < pagesize (version 3)
@ 2009-06-15 17:59 Jan Kara
  2009-06-15 17:59 ` [PATCH 01/11] ext3: Get rid of extenddisksize parameter of ext3_get_blocks_handle() Jan Kara
                   ` (12 more replies)
  0 siblings, 13 replies; 39+ messages in thread
From: Jan Kara @ 2009-06-15 17:59 UTC (permalink / raw)
  To: LKML; +Cc: linux-mm, linux-fsdevel, npiggin


patches below are an attempt to solve problems filesystems have with
page_mkwrite() when blocksize < pagesize (see the changelog of the second patch
for details).

Could someone please review them so that they can get merged - especially the
generic VFS/MM part? It fixes observed problems (WARN_ON triggers) for ext4 and
makes ext2/ext3 behave more nicely (mmapped write getting page fault instead
of silently discarding data).

The series is against Linus's tree from today. The differences against previous
version are one bugfix in ext3 delalloc implementation... Please test and review.
Thanks.

									Honza

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 01/11] ext3: Get rid of extenddisksize parameter of ext3_get_blocks_handle()
  2009-06-15 17:59 [PATCH 0/10] Fix page_mkwrite() for blocksize < pagesize (version 3) Jan Kara
@ 2009-06-15 17:59 ` Jan Kara
  2009-06-17 10:28   ` Nick Piggin
  2009-06-15 17:59 ` [PATCH 02/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize Jan Kara
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 39+ messages in thread
From: Jan Kara @ 2009-06-15 17:59 UTC (permalink / raw)
  To: LKML; +Cc: linux-mm, linux-fsdevel, npiggin, Jan Kara

Get rid of extenddisksize parameter of ext3_get_blocks_handle(). This seems to
be a relict from some old days and setting disksize in this function does not
make much sence. Currently it was set only by ext3_getblk().  Since the
parameter has some effect only if create == 1, it is easy to check that the
three callers which end up calling ext3_getblk() with create == 1 (ext3_append,
ext3_quota_write, ext3_mkdir) do the right thing and set disksize themselves.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext3/dir.c           |    3 +--
 fs/ext3/inode.c         |   13 +++----------
 include/linux/ext3_fs.h |    2 +-
 3 files changed, 5 insertions(+), 13 deletions(-)

diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c
index 3d724a9..373fa90 100644
--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -130,8 +130,7 @@ static int ext3_readdir(struct file * filp,
 		struct buffer_head *bh = NULL;
 
 		map_bh.b_state = 0;
-		err = ext3_get_blocks_handle(NULL, inode, blk, 1,
-						&map_bh, 0, 0);
+		err = ext3_get_blocks_handle(NULL, inode, blk, 1, &map_bh, 0);
 		if (err > 0) {
 			pgoff_t index = map_bh.b_blocknr >>
 					(PAGE_CACHE_SHIFT - inode->i_blkbits);
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index b0248c6..2d4b2ca 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -788,7 +788,7 @@ err_out:
 int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
 		sector_t iblock, unsigned long maxblocks,
 		struct buffer_head *bh_result,
-		int create, int extend_disksize)
+		int create)
 {
 	int err = -EIO;
 	int offsets[4];
@@ -911,13 +911,6 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
 	if (!err)
 		err = ext3_splice_branch(handle, inode, iblock,
 					partial, indirect_blks, count);
-	/*
-	 * i_disksize growing is protected by truncate_mutex.  Don't forget to
-	 * protect it if you're about to implement concurrent
-	 * ext3_get_block() -bzzz
-	*/
-	if (!err && extend_disksize && inode->i_size > ei->i_disksize)
-		ei->i_disksize = inode->i_size;
 	mutex_unlock(&ei->truncate_mutex);
 	if (err)
 		goto cleanup;
@@ -972,7 +965,7 @@ static int ext3_get_block(struct inode *inode, sector_t iblock,
 	}
 
 	ret = ext3_get_blocks_handle(handle, inode, iblock,
-					max_blocks, bh_result, create, 0);
+					max_blocks, bh_result, create);
 	if (ret > 0) {
 		bh_result->b_size = (ret << inode->i_blkbits);
 		ret = 0;
@@ -1005,7 +998,7 @@ struct buffer_head *ext3_getblk(handle_t *handle, struct inode *inode,
 	dummy.b_blocknr = -1000;
 	buffer_trace_init(&dummy.b_history);
 	err = ext3_get_blocks_handle(handle, inode, block, 1,
-					&dummy, create, 1);
+					&dummy, create);
 	/*
 	 * ext3_get_blocks_handle() returns number of blocks
 	 * mapped. 0 in case of a HOLE.
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index 634a5e5..7499b36 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -874,7 +874,7 @@ struct buffer_head * ext3_getblk (handle_t *, struct inode *, long, int, int *);
 struct buffer_head * ext3_bread (handle_t *, struct inode *, int, int, int *);
 int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
 	sector_t iblock, unsigned long maxblocks, struct buffer_head *bh_result,
-	int create, int extend_disksize);
+	int create);
 
 extern struct inode *ext3_iget(struct super_block *, unsigned long);
 extern int  ext3_write_inode (struct inode *, int);
-- 
1.6.0.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 02/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize
  2009-06-15 17:59 [PATCH 0/10] Fix page_mkwrite() for blocksize < pagesize (version 3) Jan Kara
  2009-06-15 17:59 ` [PATCH 01/11] ext3: Get rid of extenddisksize parameter of ext3_get_blocks_handle() Jan Kara
@ 2009-06-15 17:59 ` Jan Kara
  2009-06-25 16:17   ` Nick Piggin
  2009-06-15 17:59 ` [PATCH 03/11] ext2: Allocate space for mmaped file on page fault Jan Kara
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 39+ messages in thread
From: Jan Kara @ 2009-06-15 17:59 UTC (permalink / raw)
  To: LKML; +Cc: linux-mm, linux-fsdevel, npiggin, Jan Kara

page_mkwrite() is meant to be used by filesystems to allocate blocks under a
page which is becoming writeably mmapped in some process address space. This
allows a filesystem to return a page fault if there is not enough space
available, user exceeds quota or similar problem happens, rather than silently
discarding data later when writepage is called.

On filesystems where blocksize < pagesize the situation is more complicated.
Think for example that blocksize = 1024, pagesize = 4096 and a process does:
  ftruncate(fd, 0);
  pwrite(fd, buf, 1024, 0);
  map = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0);
  map[0] = 'a';  ----> page_mkwrite() for index 0 is called
  ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
  fsync(fd); ----> writepage() for index 0 is called

At the moment page_mkwrite() is called, filesystem can allocate only one block
for the page because i_size == 1024. Otherwise it would create blocks beyond
i_size which is generally undesirable. But later at writepage() time, we would
like to have blocks allocated for the whole page (and in principle we have to
allocate them because user could have filled the page with data after the
second ftruncate()). This patch introduces a framework which allows filesystems
to handle this with a reasonable effort.

The idea is following: Before we extend i_size, we obtain a special lock blocking
page_mkwrite() on the page straddling i_size. Then we writeprotect the page,
change i_size and unlock the special lock. This way, page_mkwrite() is called for
a page each time a number of blocks needed to be allocated for a page increases.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/buffer.c                 |  143 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/buffer_head.h |    4 +
 include/linux/fs.h          |   11 +++-
 mm/filemap.c                |   10 +++-
 mm/memory.c                 |    2 +-
 5 files changed, 166 insertions(+), 4 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index a3ef091..80e2630 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -40,6 +40,7 @@
 #include <linux/cpu.h>
 #include <linux/bitops.h>
 #include <linux/mpage.h>
+#include <linux/rmap.h>
 #include <linux/bit_spinlock.h>
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
@@ -1970,9 +1971,11 @@ int block_write_begin(struct file *file, struct address_space *mapping,
 	page = *pagep;
 	if (page == NULL) {
 		ownpage = 1;
+		block_lock_hole_extend(inode, pos);
 		page = grab_cache_page_write_begin(mapping, index, flags);
 		if (!page) {
 			status = -ENOMEM;
+			block_unlock_hole_extend(inode);
 			goto out;
 		}
 		*pagep = page;
@@ -1987,6 +1990,7 @@ int block_write_begin(struct file *file, struct address_space *mapping,
 			unlock_page(page);
 			page_cache_release(page);
 			*pagep = NULL;
+			block_unlock_hole_extend(inode);
 
 			/*
 			 * prepare_write() may have instantiated a few blocks
@@ -2062,6 +2066,7 @@ int generic_write_end(struct file *file, struct address_space *mapping,
 
 	unlock_page(page);
 	page_cache_release(page);
+	block_unlock_hole_extend(inode);
 
 	/*
 	 * Don't mark the inode dirty under page lock. First, it unnecessarily
@@ -2368,6 +2373,137 @@ int block_commit_write(struct page *page, unsigned from, unsigned to)
 }
 
 /*
+ * Lock inode with I_HOLE_EXTEND if the write is going to create a hole
+ * under a mmapped page. Also mark the page RO so that page_mkwrite()
+ * is called on the nearest write access to the page and clear dirty bits
+ * beyond i_size.
+ *
+ * @pos is offset to which write/truncate is happenning.
+ *
+ * Returns 1 if the lock has been acquired.
+ */
+int block_lock_hole_extend(struct inode *inode, loff_t pos)
+{
+	int bsize = 1 << inode->i_blkbits;
+	loff_t rounded_i_size;
+	struct page *page;
+	pgoff_t index;
+	struct buffer_head *head, *bh;
+	sector_t block, last_block;
+
+	/* Optimize for common case */
+	if (PAGE_CACHE_SIZE == bsize)
+		return 0;
+	/* Currently last page will not have any hole block created? */
+	rounded_i_size = (inode->i_size + bsize - 1) & ~(bsize - 1);
+	if (pos <= rounded_i_size || !(rounded_i_size & (PAGE_CACHE_SIZE - 1)))
+		return 0;
+	/*
+	 * Check the mutex here so that we don't warn on things like blockdev
+	 * writes which have different locking rules...
+	 */
+	WARN_ON(!mutex_is_locked(&inode->i_mutex));
+	spin_lock(&inode_lock);
+	/*
+	 * From now on, block_page_mkwrite() will block on the page straddling
+	 * i_size. Note that the page on which it blocks changes with the
+	 * change of i_size but that is fine since when new i_size is written
+	 * blocks for the hole will be allocated.
+	 */
+	inode->i_state |= I_HOLE_EXTEND;
+	spin_unlock(&inode_lock);
+
+	/*
+	 * Make sure page_mkwrite() is called on this page before
+	 * user is able to write any data beyond current i_size via
+	 * mmap.
+	 *
+	 * See clear_page_dirty_for_io() for details why set_page_dirty()
+	 * is needed.
+	 */
+	index = inode->i_size >> PAGE_CACHE_SHIFT;
+	page = find_lock_page(inode->i_mapping, index);
+	if (!page)
+		return 1;
+	if (page_mkclean(page))
+		set_page_dirty(page);
+	/* Zero dirty bits beyond current i_size */
+	if (page_has_buffers(page)) {
+		bh = head = page_buffers(page);
+		last_block = (inode->i_size - 1) >> inode->i_blkbits;
+		block = index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+		do {
+			if (block > last_block)
+				clear_buffer_dirty(bh);
+			bh = bh->b_this_page;
+			block++;
+		} while (bh != head);
+	}
+	unlock_page(page);
+	page_cache_release(page);
+	return 1;
+}
+EXPORT_SYMBOL(block_lock_hole_extend);
+
+/* New i_size creating hole has been written, unlock the inode */
+void block_unlock_hole_extend(struct inode *inode)
+{
+	/*
+	 * We want to clear the flag we could have set previously. Noone else
+	 * can change the flag so lockless read is reliable.
+	 */
+	if (inode->i_state & I_HOLE_EXTEND) {
+		spin_lock(&inode_lock);
+		inode->i_state &= ~I_HOLE_EXTEND;
+		spin_unlock(&inode_lock);
+		/* Prevent speculative execution through spin_unlock */
+		smp_mb();
+		wake_up_bit(&inode->i_state, __I_HOLE_EXTEND);
+	}
+}
+EXPORT_SYMBOL(block_unlock_hole_extend);
+
+void block_extend_i_size(struct inode *inode, loff_t pos, loff_t len)
+{
+	int locked;
+
+	locked = block_lock_hole_extend(inode, pos + len);
+	i_size_write(inode, pos + len);
+	if (locked)
+		block_unlock_hole_extend(inode);
+}
+EXPORT_SYMBOL(block_extend_i_size);
+
+int block_wait_on_hole_extend(struct inode *inode, loff_t pos)
+{
+	loff_t size;
+	int ret = 0;
+
+restart:
+	size = i_size_read(inode);
+	if (pos > size)
+		return -EINVAL;
+	if (pos + PAGE_CACHE_SIZE < size)
+		return ret;
+	/*
+	 * This page contains EOF; make sure we see i_state from the moment
+	 * after page table modification
+	 */
+	smp_rmb();
+	if (inode->i_state & I_HOLE_EXTEND) {
+		wait_queue_head_t *wqh;
+		DEFINE_WAIT_BIT(wqb, &inode->i_state, __I_HOLE_EXTEND);
+
+		wqh = bit_waitqueue(&inode->i_state, __I_HOLE_EXTEND);
+		__wait_on_bit(wqh, &wqb, inode_wait, TASK_UNINTERRUPTIBLE);
+		ret = 1;
+		goto restart;
+	}
+	return ret;
+}
+EXPORT_SYMBOL(block_wait_on_hole_extend);
+
+/*
  * block_page_mkwrite() is not allowed to change the file size as it gets
  * called from a page fault handler when a page is first dirtied. Hence we must
  * be careful to check for EOF conditions here. We set the page up correctly
@@ -2392,6 +2528,13 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 	loff_t size;
 	int ret = VM_FAULT_NOPAGE; /* make the VM retry the fault */
 
+	block_wait_on_hole_extend(inode, page_offset(page));
+	/*
+	 * From this moment on a write creating a hole can happen
+	 * without us waiting for it. But because it writeprotects
+	 * the page, user cannot really write to the page until next
+	 * page_mkwrite() is called. And that one will wait.
+	 */
 	lock_page(page);
 	size = i_size_read(inode);
 	if ((page->mapping != inode->i_mapping) ||
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 16ed028..56a0162 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -219,6 +219,10 @@ int cont_write_begin(struct file *, struct address_space *, loff_t,
 			get_block_t *, loff_t *);
 int generic_cont_expand_simple(struct inode *inode, loff_t size);
 int block_commit_write(struct page *page, unsigned from, unsigned to);
+int block_lock_hole_extend(struct inode *inode, loff_t pos);
+void block_unlock_hole_extend(struct inode *inode);
+int block_wait_on_hole_extend(struct inode *inode, loff_t pos);
+void block_extend_i_size(struct inode *inode, loff_t pos, loff_t len);
 int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 				get_block_t get_block);
 void block_sync_page(struct page *);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ede84fa..6df7c84 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -580,7 +580,7 @@ struct address_space_operations {
 	int (*write_end)(struct file *, struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned copied,
 				struct page *page, void *fsdata);
-
+	void (*extend_i_size)(struct inode *, loff_t pos, loff_t len);
 	/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
 	sector_t (*bmap)(struct address_space *, sector_t);
 	void (*invalidatepage) (struct page *, unsigned long);
@@ -597,6 +597,8 @@ struct address_space_operations {
 					unsigned long);
 };
 
+void do_extend_i_size(struct inode *inode, loff_t pos, loff_t len);
+
 /*
  * pagecache_write_begin/pagecache_write_end must be used by general code
  * to write into the pagecache.
@@ -1584,7 +1586,8 @@ struct super_operations {
  * until that flag is cleared.  I_WILL_FREE, I_FREEING and I_CLEAR are set at
  * various stages of removing an inode.
  *
- * Two bits are used for locking and completion notification, I_LOCK and I_SYNC.
+ * Three bits are used for locking and completion notification, I_LOCK,
+ * I_HOLE_EXTEND and I_SYNC.
  *
  * I_DIRTY_SYNC		Inode is dirty, but doesn't have to be written on
  *			fdatasync().  i_atime is the usual cause.
@@ -1622,6 +1625,8 @@ struct super_operations {
  *			of inode dirty data.  Having a separate lock for this
  *			purpose reduces latency and prevents some filesystem-
  *			specific deadlocks.
+ * I_HOLE_EXTEND	A lock synchronizing extension of a file which creates
+ *			a hole under a mmapped page with page_mkwrite().
  *
  * Q: What is the difference between I_WILL_FREE and I_FREEING?
  * Q: igrab() only checks on (I_FREEING|I_WILL_FREE).  Should it also check on
@@ -1638,6 +1643,8 @@ struct super_operations {
 #define I_LOCK			(1 << __I_LOCK)
 #define __I_SYNC		8
 #define I_SYNC			(1 << __I_SYNC)
+#define __I_HOLE_EXTEND		9
+#define I_HOLE_EXTEND		(1 << __I_HOLE_EXTEND)
 
 #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 1b60f30..5e38d7b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2079,6 +2079,14 @@ int pagecache_write_end(struct file *file, struct address_space *mapping,
 }
 EXPORT_SYMBOL(pagecache_write_end);
 
+void do_extend_i_size(struct inode *inode, loff_t pos, loff_t len)
+{
+	if (inode->i_mapping->a_ops->extend_i_size)
+		inode->i_mapping->a_ops->extend_i_size(inode, pos, len);
+	else
+		i_size_write(inode, pos + len);
+}
+
 ssize_t
 generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
 		unsigned long *nr_segs, loff_t pos, loff_t *ppos,
@@ -2139,7 +2147,7 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
 	if (written > 0) {
 		loff_t end = pos + written;
 		if (end > i_size_read(inode) && !S_ISBLK(inode->i_mode)) {
-			i_size_write(inode,  end);
+			do_extend_i_size(inode, pos, written);
 			mark_inode_dirty(inode);
 		}
 		*ppos = end;
diff --git a/mm/memory.c b/mm/memory.c
index 4126dd1..535183d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2377,7 +2377,7 @@ int vmtruncate(struct inode * inode, loff_t offset)
 			goto out_sig;
 		if (offset > inode->i_sb->s_maxbytes)
 			goto out_big;
-		i_size_write(inode, offset);
+		do_extend_i_size(inode, offset, 0);
 	} else {
 		struct address_space *mapping = inode->i_mapping;
 
-- 
1.6.0.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 03/11] ext2: Allocate space for mmaped file on page fault
  2009-06-15 17:59 [PATCH 0/10] Fix page_mkwrite() for blocksize < pagesize (version 3) Jan Kara
  2009-06-15 17:59 ` [PATCH 01/11] ext3: Get rid of extenddisksize parameter of ext3_get_blocks_handle() Jan Kara
  2009-06-15 17:59 ` [PATCH 02/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize Jan Kara
@ 2009-06-15 17:59 ` Jan Kara
  2009-06-15 17:59 ` [PATCH 04/11] ext4: Make sure blocks are properly allocated under mmaped page even when blocksize < pagesize Jan Kara
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2009-06-15 17:59 UTC (permalink / raw)
  To: LKML; +Cc: linux-mm, linux-fsdevel, npiggin, Jan Kara

So far we've allocated space at ->writepage() time. This has the disadvantage
that when we hit ENOSPC or other error, we cannot do much - either throw
away the data or keep the page indefinitely (and loose the data on reboot).
So allocate space already when a page is faulted in.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext2/file.c  |   26 +++++++++++++++++++++++++-
 fs/ext2/inode.c |    1 +
 2 files changed, 26 insertions(+), 1 deletions(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 2b9e47d..d0a5f13 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -19,6 +19,8 @@
  */
 
 #include <linux/time.h>
+#include <linux/mm.h>
+#include <linux/buffer_head.h>
 #include "ext2.h"
 #include "xattr.h"
 #include "acl.h"
@@ -38,6 +40,28 @@ static int ext2_release_file (struct inode * inode, struct file * filp)
 	return 0;
 }
 
+static int ext2_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return block_page_mkwrite(vma, vmf, ext2_get_block);
+}
+
+static struct vm_operations_struct ext2_file_vm_ops = {
+	.fault		= filemap_fault,
+	.page_mkwrite	= ext2_page_mkwrite,
+};
+
+static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct address_space *mapping = file->f_mapping;
+
+	if (!mapping->a_ops->readpage)
+		return -ENOEXEC;
+	file_accessed(file);
+	vma->vm_ops = &ext2_file_vm_ops;
+	vma->vm_flags |= VM_CAN_NONLINEAR;
+	return 0;
+}
+
 /*
  * We have mostly NULL's here: the current defaults are ok for
  * the ext2 filesystem.
@@ -52,7 +76,7 @@ const struct file_operations ext2_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= ext2_compat_ioctl,
 #endif
-	.mmap		= generic_file_mmap,
+	.mmap		= ext2_file_mmap,
 	.open		= generic_file_open,
 	.release	= ext2_release_file,
 	.fsync		= simple_fsync,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 29ed682..3805b6b 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -814,6 +814,7 @@ const struct address_space_operations ext2_aops = {
 	.sync_page		= block_sync_page,
 	.write_begin		= ext2_write_begin,
 	.write_end		= generic_write_end,
+	.extend_i_size		= block_extend_i_size,
 	.bmap			= ext2_bmap,
 	.direct_IO		= ext2_direct_IO,
 	.writepages		= ext2_writepages,
-- 
1.6.0.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 04/11] ext4: Make sure blocks are properly allocated under mmaped page even when blocksize < pagesize
  2009-06-15 17:59 [PATCH 0/10] Fix page_mkwrite() for blocksize < pagesize (version 3) Jan Kara
                   ` (2 preceding siblings ...)
  2009-06-15 17:59 ` [PATCH 03/11] ext2: Allocate space for mmaped file on page fault Jan Kara
@ 2009-06-15 17:59 ` Jan Kara
  2009-06-15 17:59 ` [PATCH 05/11] ext3: Allocate space for mmaped file on page fault Jan Kara
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2009-06-15 17:59 UTC (permalink / raw)
  To: LKML; +Cc: linux-mm, linux-fsdevel, npiggin, Jan Kara

In a situation like:
  truncate(f, 1024);
  a = mmap(f, 0, 4096);
  a[0] = 'a';
  truncate(f, 4096);

we end up with a dirty page which does not have all blocks allocated /
reserved.  Fix the problem by using new VFS infrastructure.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/extents.c |    2 +-
 fs/ext4/inode.c   |   20 ++++++++++++++++++--
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 2593f74..764c394 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3073,7 +3073,7 @@ static void ext4_falloc_update_inode(struct inode *inode,
 	 */
 	if (!(mode & FALLOC_FL_KEEP_SIZE)) {
 		if (new_size > i_size_read(inode))
-			i_size_write(inode, new_size);
+			block_extend_i_size(inode, new_size, 0);
 		if (new_size > EXT4_I(inode)->i_disksize)
 			ext4_update_i_disksize(inode, new_size);
 	}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 875db94..3cad61b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1478,7 +1478,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
  	index = pos >> PAGE_CACHE_SHIFT;
 	from = pos & (PAGE_CACHE_SIZE - 1);
 	to = from + len;
-
+	block_lock_hole_extend(inode, pos);
 retry:
 	handle = ext4_journal_start(inode, needed_blocks);
 	if (IS_ERR(handle)) {
@@ -1537,6 +1537,8 @@ retry:
 	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
 		goto retry;
 out:
+	if (ret)
+		block_unlock_hole_extend(inode);
 	return ret;
 }
 
@@ -1735,6 +1737,7 @@ static int ext4_journalled_write_end(struct file *file,
 
 	unlock_page(page);
 	page_cache_release(page);
+	block_unlock_hole_extend(inode);
 	if (pos + len > inode->i_size)
 		/* if we have allocated more blocks and copied
 		 * less. We will have blocks allocated outside
@@ -2909,6 +2912,7 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
 		   "dev %s ino %lu pos %llu len %u flags %u",
 		   inode->i_sb->s_id, inode->i_ino,
 		   (unsigned long long) pos, len, flags);
+	block_lock_hole_extend(inode, pos);
 retry:
 	/*
 	 * With delayed allocation, we don't log the i_disksize update
@@ -2951,6 +2955,8 @@ retry:
 	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
 		goto retry;
 out:
+	if (ret)
+		block_unlock_hole_extend(inode);
 	return ret;
 }
 
@@ -3496,7 +3502,7 @@ static ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
 			loff_t end = offset + ret;
 			if (end > inode->i_size) {
 				ei->i_disksize = end;
-				i_size_write(inode, end);
+				block_extend_i_size(inode, offset, ret);
 				/*
 				 * We're going to return a positive `ret'
 				 * here due to non-zero-length I/O, so there's
@@ -3541,6 +3547,7 @@ static const struct address_space_operations ext4_ordered_aops = {
 	.sync_page		= block_sync_page,
 	.write_begin		= ext4_write_begin,
 	.write_end		= ext4_ordered_write_end,
+	.extend_i_size		= block_extend_i_size,
 	.bmap			= ext4_bmap,
 	.invalidatepage		= ext4_invalidatepage,
 	.releasepage		= ext4_releasepage,
@@ -3556,6 +3563,7 @@ static const struct address_space_operations ext4_writeback_aops = {
 	.sync_page		= block_sync_page,
 	.write_begin		= ext4_write_begin,
 	.write_end		= ext4_writeback_write_end,
+	.extend_i_size		= block_extend_i_size,
 	.bmap			= ext4_bmap,
 	.invalidatepage		= ext4_invalidatepage,
 	.releasepage		= ext4_releasepage,
@@ -3571,6 +3579,7 @@ static const struct address_space_operations ext4_journalled_aops = {
 	.sync_page		= block_sync_page,
 	.write_begin		= ext4_write_begin,
 	.write_end		= ext4_journalled_write_end,
+	.extend_i_size		= block_extend_i_size,
 	.set_page_dirty		= ext4_journalled_set_page_dirty,
 	.bmap			= ext4_bmap,
 	.invalidatepage		= ext4_invalidatepage,
@@ -3586,6 +3595,7 @@ static const struct address_space_operations ext4_da_aops = {
 	.sync_page		= block_sync_page,
 	.write_begin		= ext4_da_write_begin,
 	.write_end		= ext4_da_write_end,
+	.extend_i_size		= block_extend_i_size,
 	.bmap			= ext4_bmap,
 	.invalidatepage		= ext4_da_invalidatepage,
 	.releasepage		= ext4_releasepage,
@@ -5433,6 +5443,12 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	struct address_space *mapping = inode->i_mapping;
 
 	/*
+	 * Wait for extending of i_size, after this moment, next truncate /
+	 * write can create holes under us but they writeprotect our page so
+	 * we'll be called again to fill the hole.
+	 */
+	block_wait_on_hole_extend(inode, page_offset(page));
+	/*
 	 * Get i_alloc_sem to stop truncates messing with the inode. We cannot
 	 * get i_mutex because we are already holding mmap_sem.
 	 */
-- 
1.6.0.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 05/11] ext3: Allocate space for mmaped file on page fault
  2009-06-15 17:59 [PATCH 0/10] Fix page_mkwrite() for blocksize < pagesize (version 3) Jan Kara
                   ` (3 preceding siblings ...)
  2009-06-15 17:59 ` [PATCH 04/11] ext4: Make sure blocks are properly allocated under mmaped page even when blocksize < pagesize Jan Kara
@ 2009-06-15 17:59 ` Jan Kara
  2009-06-15 17:59 ` [PATCH 06/11] vfs: Implement generic per-cpu counters for delayed allocation Jan Kara
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2009-06-15 17:59 UTC (permalink / raw)
  To: LKML; +Cc: linux-mm, linux-fsdevel, npiggin, Jan Kara

So far we've allocated space at ->writepage() time. This has the disadvantage
that when we hit ENOSPC or other error, we cannot do much - either throw
away the data or keep the page indefinitely (and loose the data on reboot).
So allocate space already when a page is faulted in.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext3/file.c          |   19 ++++-
 fs/ext3/inode.c         |  220 +++++++++++++++++++++--------------------------
 include/linux/ext3_fs.h |    1 +
 3 files changed, 116 insertions(+), 124 deletions(-)

diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index 5b49704..a7dce9d 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -110,6 +110,23 @@ force_commit:
 	return ret;
 }
 
+static struct vm_operations_struct ext3_file_vm_ops = {
+	.fault		= filemap_fault,
+	.page_mkwrite	= ext3_page_mkwrite,
+};
+
+static int ext3_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct address_space *mapping = file->f_mapping;
+
+	if (!mapping->a_ops->readpage)
+		return -ENOEXEC;
+	file_accessed(file);
+	vma->vm_ops = &ext3_file_vm_ops;
+	vma->vm_flags |= VM_CAN_NONLINEAR;
+	return 0;
+}
+
 const struct file_operations ext3_file_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= do_sync_read,
@@ -120,7 +137,7 @@ const struct file_operations ext3_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= ext3_compat_ioctl,
 #endif
-	.mmap		= generic_file_mmap,
+	.mmap		= ext3_file_mmap,
 	.open		= generic_file_open,
 	.release	= ext3_release_file,
 	.fsync		= ext3_sync_file,
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 2d4b2ca..ec112b4 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1156,10 +1156,13 @@ static int ext3_write_begin(struct file *file, struct address_space *mapping,
 	from = pos & (PAGE_CACHE_SIZE - 1);
 	to = from + len;
 
+	block_lock_hole_extend(inode, pos);
 retry:
 	page = grab_cache_page_write_begin(mapping, index, flags);
-	if (!page)
-		return -ENOMEM;
+	if (!page) {
+		ret = -ENOMEM;
+		goto out;
+	}
 	*pagep = page;
 
 	handle = ext3_journal_start(inode, needed_blocks);
@@ -1199,6 +1202,8 @@ write_begin_failed:
 	if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
 		goto retry;
 out:
+	if (ret)
+		block_unlock_hole_extend(inode);
 	return ret;
 }
 
@@ -1290,6 +1295,7 @@ static int ext3_ordered_write_end(struct file *file,
 
 	if (pos + len > inode->i_size)
 		vmtruncate(inode, inode->i_size);
+	block_unlock_hole_extend(inode);
 	return ret ? ret : copied;
 }
 
@@ -1316,6 +1322,7 @@ static int ext3_writeback_write_end(struct file *file,
 
 	if (pos + len > inode->i_size)
 		vmtruncate(inode, inode->i_size);
+	block_unlock_hole_extend(inode);
 	return ret ? ret : copied;
 }
 
@@ -1369,6 +1376,7 @@ static int ext3_journalled_write_end(struct file *file,
 
 	if (pos + len > inode->i_size)
 		vmtruncate(inode, inode->i_size);
+	block_unlock_hole_extend(inode);
 	return ret ? ret : copied;
 }
 
@@ -1424,18 +1432,6 @@ static sector_t ext3_bmap(struct address_space *mapping, sector_t block)
 	return generic_block_bmap(mapping,block,ext3_get_block);
 }
 
-static int bget_one(handle_t *handle, struct buffer_head *bh)
-{
-	get_bh(bh);
-	return 0;
-}
-
-static int bput_one(handle_t *handle, struct buffer_head *bh)
-{
-	put_bh(bh);
-	return 0;
-}
-
 static int buffer_unmapped(handle_t *handle, struct buffer_head *bh)
 {
 	return !buffer_mapped(bh);
@@ -1487,125 +1483,25 @@ static int buffer_unmapped(handle_t *handle, struct buffer_head *bh)
  *   We'll probably need that anyway for journalling writepage() output.
  *
  * We don't honour synchronous mounts for writepage().  That would be
- * disastrous.  Any write() or metadata operation will sync the fs for
+ * disastrous. Any write() or metadata operation will sync the fs for
  * us.
  *
- * AKPM2: if all the page's buffers are mapped to disk and !data=journal,
- * we don't need to open a transaction here.
+ * Note, even though we try, we *may* end up allocating blocks here because
+ * page_mkwrite() has not allocated blocks yet but dirty buffers were created
+ * under the whole page, not just the part inside old i_size. We could just
+ * ignore writing such buffers but it would be harder to avoid it then just
+ * do it...
  */
-static int ext3_ordered_writepage(struct page *page,
-				struct writeback_control *wbc)
-{
-	struct inode *inode = page->mapping->host;
-	struct buffer_head *page_bufs;
-	handle_t *handle = NULL;
-	int ret = 0;
-	int err;
-
-	J_ASSERT(PageLocked(page));
-
-	/*
-	 * We give up here if we're reentered, because it might be for a
-	 * different filesystem.
-	 */
-	if (ext3_journal_current_handle())
-		goto out_fail;
-
-	if (!page_has_buffers(page)) {
-		create_empty_buffers(page, inode->i_sb->s_blocksize,
-				(1 << BH_Dirty)|(1 << BH_Uptodate));
-		page_bufs = page_buffers(page);
-	} else {
-		page_bufs = page_buffers(page);
-		if (!walk_page_buffers(NULL, page_bufs, 0, PAGE_CACHE_SIZE,
-				       NULL, buffer_unmapped)) {
-			/* Provide NULL get_block() to catch bugs if buffers
-			 * weren't really mapped */
-			return block_write_full_page(page, NULL, wbc);
-		}
-	}
-	handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
-
-	if (IS_ERR(handle)) {
-		ret = PTR_ERR(handle);
-		goto out_fail;
-	}
-
-	walk_page_buffers(handle, page_bufs, 0,
-			PAGE_CACHE_SIZE, NULL, bget_one);
-
-	ret = block_write_full_page(page, ext3_get_block, wbc);
-
-	/*
-	 * The page can become unlocked at any point now, and
-	 * truncate can then come in and change things.  So we
-	 * can't touch *page from now on.  But *page_bufs is
-	 * safe due to elevated refcount.
-	 */
-
-	/*
-	 * And attach them to the current transaction.  But only if
-	 * block_write_full_page() succeeded.  Otherwise they are unmapped,
-	 * and generally junk.
-	 */
-	if (ret == 0) {
-		err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
-					NULL, journal_dirty_data_fn);
-		if (!ret)
-			ret = err;
-	}
-	walk_page_buffers(handle, page_bufs, 0,
-			PAGE_CACHE_SIZE, NULL, bput_one);
-	err = ext3_journal_stop(handle);
-	if (!ret)
-		ret = err;
-	return ret;
-
-out_fail:
-	redirty_page_for_writepage(wbc, page);
-	unlock_page(page);
-	return ret;
-}
-
-static int ext3_writeback_writepage(struct page *page,
+static int ext3_common_writepage(struct page *page,
 				struct writeback_control *wbc)
 {
 	struct inode *inode = page->mapping->host;
-	handle_t *handle = NULL;
 	int ret = 0;
-	int err;
-
-	if (ext3_journal_current_handle())
-		goto out_fail;
-
-	if (page_has_buffers(page)) {
-		if (!walk_page_buffers(NULL, page_buffers(page), 0,
-				      PAGE_CACHE_SIZE, NULL, buffer_unmapped)) {
-			/* Provide NULL get_block() to catch bugs if buffers
-			 * weren't really mapped */
-			return block_write_full_page(page, NULL, wbc);
-		}
-	}
-
-	handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
-	if (IS_ERR(handle)) {
-		ret = PTR_ERR(handle);
-		goto out_fail;
-	}
 
 	if (test_opt(inode->i_sb, NOBH) && ext3_should_writeback_data(inode))
 		ret = nobh_writepage(page, ext3_get_block, wbc);
 	else
 		ret = block_write_full_page(page, ext3_get_block, wbc);
-
-	err = ext3_journal_stop(handle);
-	if (!ret)
-		ret = err;
-	return ret;
-
-out_fail:
-	redirty_page_for_writepage(wbc, page);
-	unlock_page(page);
 	return ret;
 }
 
@@ -1752,9 +1648,11 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
 	if (orphan) {
 		int err;
 
+		block_lock_hole_extend(inode, offset);
 		/* Credits for sb + inode write */
 		handle = ext3_journal_start(inode, 2);
 		if (IS_ERR(handle)) {
+			block_unlock_hole_extend(inode);
 			/* This is really bad luck. We've written the data
 			 * but cannot extend i_size. Bail out and pretend
 			 * the write failed... */
@@ -1781,11 +1679,84 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
 		err = ext3_journal_stop(handle);
 		if (ret == 0)
 			ret = err;
+		block_unlock_hole_extend(inode);
 	}
 out:
 	return ret;
 }
 
+int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct page *page = vmf->page;
+	struct file *file = vma->vm_file;
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = file->f_path.dentry->d_inode;
+	int ret = VM_FAULT_NOPAGE;
+	loff_t size;
+	int len;
+	void *fsdata;
+
+	block_wait_on_hole_extend(inode, page_offset(page));
+	/*
+	 * Get i_alloc_sem to stop truncates messing with the inode. We cannot
+	 * get i_mutex because we are already holding mmap_sem.
+	 */
+	down_read(&inode->i_alloc_sem);
+	size = i_size_read(inode);
+	if ((page->mapping != inode->i_mapping) ||
+	    (page_offset(page) > size)) {
+		/* page got truncated out from underneath us */
+		goto out_unlock;
+	}
+
+	/* page is wholly or partially inside EOF */
+	if (((page->index + 1) << PAGE_CACHE_SHIFT) > size)
+		len = size & ~PAGE_CACHE_MASK;
+	else
+		len = PAGE_CACHE_SIZE;
+
+	/*
+	 * Check for the common case that everything is already mapped. We
+	 * have to get the page lock so that buffers cannot be released
+	 * under us.
+	 */
+	lock_page(page);
+	if (page_has_buffers(page)) {
+		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
+				       buffer_unmapped)) {
+			unlock_page(page);
+			ret = 0;
+			goto out_unlock;
+		}
+	}
+	unlock_page(page);
+
+	/*
+	 * OK, we may need to fill the hole... Do write_begin write_end to do
+	 * block allocation/reservation. We are not holding inode.i_mutex
+	 * here. That allows parallel write_begin, write_end call. lock_page
+	 * prevent this from happening on the same page though.
+	 */
+	ret = mapping->a_ops->write_begin(file, mapping, page_offset(page),
+			len, AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
+	if (ret < 0)
+		goto out_unlock;
+	ret = mapping->a_ops->write_end(file, mapping, page_offset(page),
+			len, len, page, fsdata);
+	if (ret < 0)
+		goto out_unlock;
+	ret = 0;
+out_unlock:
+	if (unlikely(ret)) {
+		if (ret == -ENOMEM)
+			ret = VM_FAULT_OOM;
+		else /* -ENOSPC, -EIO, etc */
+			ret = VM_FAULT_SIGBUS;
+	}
+	up_read(&inode->i_alloc_sem);
+	return ret;
+}
+
 /*
  * Pages can be marked dirty completely asynchronously from ext3's journalling
  * activity.  By filemap_sync_pte(), try_to_unmap_one(), etc.  We cannot do
@@ -1808,10 +1779,11 @@ static int ext3_journalled_set_page_dirty(struct page *page)
 static const struct address_space_operations ext3_ordered_aops = {
 	.readpage		= ext3_readpage,
 	.readpages		= ext3_readpages,
-	.writepage		= ext3_ordered_writepage,
+	.writepage		= ext3_common_writepage,
 	.sync_page		= block_sync_page,
 	.write_begin		= ext3_write_begin,
 	.write_end		= ext3_ordered_write_end,
+	.extend_i_size		= block_extend_i_size,
 	.bmap			= ext3_bmap,
 	.invalidatepage		= ext3_invalidatepage,
 	.releasepage		= ext3_releasepage,
@@ -1823,10 +1795,11 @@ static const struct address_space_operations ext3_ordered_aops = {
 static const struct address_space_operations ext3_writeback_aops = {
 	.readpage		= ext3_readpage,
 	.readpages		= ext3_readpages,
-	.writepage		= ext3_writeback_writepage,
+	.writepage		= ext3_common_writepage,
 	.sync_page		= block_sync_page,
 	.write_begin		= ext3_write_begin,
 	.write_end		= ext3_writeback_write_end,
+	.extend_i_size		= block_extend_i_size,
 	.bmap			= ext3_bmap,
 	.invalidatepage		= ext3_invalidatepage,
 	.releasepage		= ext3_releasepage,
@@ -1842,6 +1815,7 @@ static const struct address_space_operations ext3_journalled_aops = {
 	.sync_page		= block_sync_page,
 	.write_begin		= ext3_write_begin,
 	.write_end		= ext3_journalled_write_end,
+	.extend_i_size		= block_extend_i_size,
 	.set_page_dirty		= ext3_journalled_set_page_dirty,
 	.bmap			= ext3_bmap,
 	.invalidatepage		= ext3_invalidatepage,
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index 7499b36..5051874 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -892,6 +892,7 @@ extern void ext3_get_inode_flags(struct ext3_inode_info *);
 extern void ext3_set_aops(struct inode *inode);
 extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		       u64 start, u64 len);
+extern int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
 
 /* ioctl.c */
 extern long ext3_ioctl(struct file *, unsigned int, unsigned long);
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 06/11] vfs: Implement generic per-cpu counters for delayed allocation
  2009-06-15 17:59 [PATCH 0/10] Fix page_mkwrite() for blocksize < pagesize (version 3) Jan Kara
                   ` (4 preceding siblings ...)
  2009-06-15 17:59 ` [PATCH 05/11] ext3: Allocate space for mmaped file on page fault Jan Kara
@ 2009-06-15 17:59 ` Jan Kara
  2009-06-15 17:59 ` [PATCH 07/11] vfs: Unmap underlying metadata of new data buffers only when buffer is mapped Jan Kara
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2009-06-15 17:59 UTC (permalink / raw)
  To: LKML; +Cc: linux-mm, linux-fsdevel, npiggin, Jan Kara

Implement free blocks and reserved blocks counters for delayed
allocation. These counters are reliable in the sence that when
they return success, the subsequent conversion from reserved to
allocated blocks always succeeds (see comments in the code for
details). This is useful for ext? based filesystems to implement
delayed allocation in particular for allocation in page_mkwrite.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/Kconfig                       |    4 ++
 fs/Makefile                      |    1 +
 fs/delalloc_counter.c            |  109 ++++++++++++++++++++++++++++++++++++++
 fs/ext3/Kconfig                  |    1 +
 include/linux/delalloc_counter.h |   63 ++++++++++++++++++++++
 5 files changed, 178 insertions(+), 0 deletions(-)
 create mode 100644 fs/delalloc_counter.c
 create mode 100644 include/linux/delalloc_counter.h

diff --git a/fs/Kconfig b/fs/Kconfig
index 525da2e..82882b9 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -19,6 +19,10 @@ config FS_XIP
 source "fs/jbd/Kconfig"
 source "fs/jbd2/Kconfig"
 
+config DELALLOC_COUNTER
+	bool
+	default y if EXT3_FS=y
+
 config FS_MBCACHE
 # Meta block cache for Extended Attributes (ext2/ext3/ext4)
 	tristate
diff --git a/fs/Makefile b/fs/Makefile
index af6d047..b5614fc 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -19,6 +19,7 @@ else
 obj-y +=	no-block.o
 endif
 
+obj-$(CONFIG_DELALLOC_COUNTER)	+= delalloc_counter.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o
 obj-y				+= notify/
 obj-$(CONFIG_EPOLL)		+= eventpoll.o
diff --git a/fs/delalloc_counter.c b/fs/delalloc_counter.c
new file mode 100644
index 0000000..0f575d5
--- /dev/null
+++ b/fs/delalloc_counter.c
@@ -0,0 +1,109 @@
+/*
+ *  Per-cpu counters for delayed allocation
+ */
+#include <linux/percpu_counter.h>
+#include <linux/delalloc_counter.h>
+#include <linux/module.h>
+#include <linux/log2.h>
+
+static long dac_error(struct delalloc_counter *c)
+{
+#ifdef CONFIG_SMP
+	return c->batch * nr_cpu_ids;
+#else
+	return 0;
+#endif
+}
+
+/*
+ * Reserve blocks for delayed allocation
+ *
+ * This code is subtle because we want to avoid synchronization of processes
+ * doing allocation in the common case when there's plenty of space in the
+ * filesystem.
+ *
+ * The code maintains the following property: Among all the calls to
+ * dac_reserve() that return 0 there exists a simple sequential ordering of
+ * these calls such that the check (free - reserved >= limit) in each call
+ * succeeds. This guarantees that we never reserve blocks we don't have.
+ *
+ * The proof of the above invariant: The function can return 0 either when the
+ * first if succeeds or when both ifs fail. To the first type of callers we
+ * assign the time of read of c->reserved in the first if, to the second type
+ * of callers we assign the time of read of c->reserved in the second if. We
+ * order callers by their assigned time and claim that this is the ordering
+ * required by the invariant. Suppose that a check (free - reserved >= limit)
+ * fails for caller C in the proposed ordering. We distinguish two cases:
+ * 1) function called by C returned zero because the first if succeeded - in
+ *  this case reads of counters in the first if must have seen effects of
+ *  __percpu_counter_add of all the callers before C (even their condition
+ *  evaluation happened before our). The errors accumulated in cpu-local
+ *  variables are clearly < dac_error(c) and thus the condition should fail.
+ *  Contradiction.
+ * 2) function called by C returned zero because the second if failed - again
+ *  the read of the counters must have seen effects of __percpu_counter_add of
+ *  all the callers before C and thus the condition should have succeeded.
+ *  Contradiction.
+ */
+int dac_reserve(struct delalloc_counter *c, s32 amount, s64 limit)
+{
+	s64 free, reserved;
+	int ret = 0;
+
+	__percpu_counter_add(&c->reserved, amount, c->batch);
+	/*
+	 * This barrier makes sure that when effects of the following read of
+	 * c->reserved are observable by another CPU also effects of the
+	 * previous store to c->reserved are seen.
+	 */
+	smp_mb();
+	if (percpu_counter_read(&c->free) - percpu_counter_read(&c->reserved)
+	    - 2 * dac_error(c) >= limit)
+		return ret;
+	/*
+	 * Near the limit - sum the counter to avoid returning ENOSPC too
+	 * early. Note that we can still "unnecessarily" return ENOSPC when
+	 * there are several racing writers. Spinlock in this section would
+	 * solve it but let's ignore it for now.
+	 */
+	free = percpu_counter_sum_positive(&c->free);
+	reserved = percpu_counter_sum_positive(&c->reserved);
+	if (free - reserved < limit) {
+		__percpu_counter_add(&c->reserved, -amount, c->batch);
+		ret = -ENOSPC;
+	}
+	return ret;
+}
+EXPORT_SYMBOL(dac_reserve);
+
+/* Account reserved blocks as allocated */
+void dac_alloc_reserved(struct delalloc_counter *c, s32 amount)
+{
+	__percpu_counter_add(&c->free, -amount, c->batch);
+	/*
+	 * Make sure update of free counter is seen before update of
+	 * reserved counter.
+	 */
+	smp_wmb();
+	__percpu_counter_add(&c->reserved, -amount, c->batch);
+}
+EXPORT_SYMBOL(dac_alloc_reserved);
+
+int dac_init(struct delalloc_counter *c, s64 amount)
+{
+	int err;
+
+	c->batch = 8*(1+ilog2(nr_cpu_ids));
+	err = percpu_counter_init(&c->free, amount);
+	if (!err)
+		err = percpu_counter_init(&c->reserved, 0);
+	return err;
+}
+EXPORT_SYMBOL(dac_init);
+
+void dac_destroy(struct delalloc_counter *c)
+{
+	percpu_counter_destroy(&c->free);
+	percpu_counter_destroy(&c->reserved);
+}
+EXPORT_SYMBOL(dac_destroy);
diff --git a/fs/ext3/Kconfig b/fs/ext3/Kconfig
index fb3c1a2..f4e122f 100644
--- a/fs/ext3/Kconfig
+++ b/fs/ext3/Kconfig
@@ -1,6 +1,7 @@
 config EXT3_FS
 	tristate "Ext3 journalling file system support"
 	select JBD
+	select DELALLOC_COUNTER
 	help
 	  This is the journalling version of the Second extended file system
 	  (often called ext3), the de facto standard Linux file system
diff --git a/include/linux/delalloc_counter.h b/include/linux/delalloc_counter.h
new file mode 100644
index 0000000..9d00b6c
--- /dev/null
+++ b/include/linux/delalloc_counter.h
@@ -0,0 +1,63 @@
+#ifndef _LINUX_DELALLOC_COUNTER_H
+#define _LINUX_DELALLOC_COUNTER_H
+
+#include <linux/percpu_counter.h>
+
+struct delalloc_counter {
+	struct percpu_counter free;
+	struct percpu_counter reserved;
+	int batch;
+};
+
+int dac_reserve(struct delalloc_counter *c, s32 amount, s64 limit);
+void dac_alloc_reserved(struct delalloc_counter *c, s32 amount);
+
+static inline int dac_alloc(struct delalloc_counter *c, s32 amount, s64 limit)
+{
+	int ret = dac_reserve(c, amount, limit);
+	if (!ret)
+		dac_alloc_reserved(c, amount);
+	return ret;
+}
+
+static inline void dac_free(struct delalloc_counter *c, s32 amount)
+{
+        __percpu_counter_add(&c->free, amount, c->batch);
+}
+
+static inline void dac_cancel_reserved(struct delalloc_counter *c, s32 amount)
+{
+        __percpu_counter_add(&c->reserved, -amount, c->batch);
+}
+
+int dac_init(struct delalloc_counter *c, s64 amount);
+void dac_destroy(struct delalloc_counter *c);
+
+static inline s64 dac_get_avail(struct delalloc_counter *c)
+{
+	s64 ret = percpu_counter_read(&c->free) -
+	       percpu_counter_read(&c->reserved);
+	if (ret < 0)
+		return 0;
+	return ret;
+}
+
+static inline s64 dac_get_avail_sum(struct delalloc_counter *c)
+{
+	s64 ret = percpu_counter_sum(&c->free) -
+	       percpu_counter_sum(&c->reserved);
+	if (ret < 0)
+		return 0;
+	return ret;
+}
+
+static inline s64 dac_get_reserved(struct delalloc_counter *c)
+{
+	return percpu_counter_read_positive(&c->reserved);
+}
+
+static inline s64 dac_get_reserved_sum(struct delalloc_counter *c)
+{
+	return percpu_counter_sum_positive(&c->reserved);
+}
+#endif
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 07/11] vfs: Unmap underlying metadata of new data buffers only when buffer is mapped
  2009-06-15 17:59 [PATCH 0/10] Fix page_mkwrite() for blocksize < pagesize (version 3) Jan Kara
                   ` (5 preceding siblings ...)
  2009-06-15 17:59 ` [PATCH 06/11] vfs: Implement generic per-cpu counters for delayed allocation Jan Kara
@ 2009-06-15 17:59 ` Jan Kara
  2009-06-17 10:35   ` Nick Piggin
  2009-06-18 11:51   ` OGAWA Hirofumi
  2009-06-15 17:59 ` [PATCH 08/11] fs: Don't clear dirty bits in block_write_full_page() Jan Kara
                   ` (5 subsequent siblings)
  12 siblings, 2 replies; 39+ messages in thread
From: Jan Kara @ 2009-06-15 17:59 UTC (permalink / raw)
  To: LKML; +Cc: linux-mm, linux-fsdevel, npiggin, Jan Kara

When we do delayed allocation of some buffer, we want to signal to VFS that
the buffer is new (set buffer_new) so that it properly zeros out everything.
But we don't have the buffer mapped yet so we cannot really unmap underlying
metadata in this state. Make VFS avoid doing unmapping of metadata when the
buffer is not yet mapped.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/buffer.c |   12 +++++++-----
 1 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 80e2630..7eb1710 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1683,8 +1683,9 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
 			if (buffer_new(bh)) {
 				/* blockdev mappings never come here */
 				clear_buffer_new(bh);
-				unmap_underlying_metadata(bh->b_bdev,
-							bh->b_blocknr);
+				if (buffer_mapped(bh))
+					unmap_underlying_metadata(bh->b_bdev,
+						bh->b_blocknr);
 			}
 		}
 		bh = bh->b_this_page;
@@ -1869,8 +1870,9 @@ static int __block_prepare_write(struct inode *inode, struct page *page,
 			if (err)
 				break;
 			if (buffer_new(bh)) {
-				unmap_underlying_metadata(bh->b_bdev,
-							bh->b_blocknr);
+				if (buffer_mapped(bh))
+					unmap_underlying_metadata(bh->b_bdev,
+						bh->b_blocknr);
 				if (PageUptodate(page)) {
 					clear_buffer_new(bh);
 					set_buffer_uptodate(bh);
@@ -2683,7 +2685,7 @@ int nobh_write_begin(struct file *file, struct address_space *mapping,
 			goto failed;
 		if (!buffer_mapped(bh))
 			is_mapped_to_disk = 0;
-		if (buffer_new(bh))
+		if (buffer_new(bh) && buffer_mapped(bh))
 			unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
 		if (PageUptodate(page)) {
 			set_buffer_uptodate(bh);
-- 
1.6.0.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 08/11] fs: Don't clear dirty bits in block_write_full_page()
  2009-06-15 17:59 [PATCH 0/10] Fix page_mkwrite() for blocksize < pagesize (version 3) Jan Kara
                   ` (6 preceding siblings ...)
  2009-06-15 17:59 ` [PATCH 07/11] vfs: Unmap underlying metadata of new data buffers only when buffer is mapped Jan Kara
@ 2009-06-15 17:59 ` Jan Kara
  2009-06-15 17:59 ` [PATCH 09/11] vfs: Export wakeup_pdflush Jan Kara
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2009-06-15 17:59 UTC (permalink / raw)
  To: LKML; +Cc: linux-mm, linux-fsdevel, npiggin, Jan Kara

If getblock() fails in block_write_full_page(), we don't want to clear
dirty bits on buffers. Actually, we even want to redirty the page. This
way we just won't silently discard users data (written e.g. through mmap)
in case of ENOSPC, EDQUOT, EIO or other write error (which may be just
transient e.g. because we have to commit a transaction to free up some space).
The downside of this approach is that if the error is persistent we have this
page pinned in memory forever and if there are lots of such pages, we can bring
the machine OOM.

We also don't want to clear dirty bits from buffers above i_size because that
is a generally a bussiness of invalidatepage where filesystem might want to do
some additional work. If we clear dirty bits already in block_write_full_page,
memory reclaim can reap the page before invalidatepage is called on the page
and thus confusing the filesystem.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/buffer.c |   40 +++++++++++++++++-----------------------
 1 files changed, 17 insertions(+), 23 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 7eb1710..21a8cb9 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1662,19 +1662,14 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
 	 * handle any aliases from the underlying blockdev's mapping.
 	 */
 	do {
-		if (block > last_block) {
-			/*
-			 * mapped buffers outside i_size will occur, because
-			 * this page can be outside i_size when there is a
-			 * truncate in progress.
-			 */
-			/*
-			 * The buffer was zeroed by block_write_full_page()
-			 */
-			clear_buffer_dirty(bh);
-			set_buffer_uptodate(bh);
-		} else if ((!buffer_mapped(bh) || buffer_delay(bh)) &&
-			   buffer_dirty(bh)) {
+		/*
+		 * Mapped buffers outside i_size will occur, because
+		 * this page can be outside i_size when there is a
+		 * truncate in progress.
+		 */
+		if (block <= last_block &&
+		    (!buffer_mapped(bh) || buffer_delay(bh)) &&
+		    buffer_dirty(bh)) {
 			WARN_ON(bh->b_size != blocksize);
 			err = get_block(inode, block, bh, 1);
 			if (err)
@@ -1692,9 +1687,10 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
 		block++;
 	} while (bh != head);
 
+	block = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
 	do {
-		if (!buffer_mapped(bh))
-			continue;
+		if (!buffer_mapped(bh) || block > last_block)
+			goto next;
 		/*
 		 * If it's a fully non-blocking write attempt and we cannot
 		 * lock the buffer then redirty the page.  Note that this can
@@ -1706,13 +1702,15 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
 			lock_buffer(bh);
 		} else if (!trylock_buffer(bh)) {
 			redirty_page_for_writepage(wbc, page);
-			continue;
+			goto next;
 		}
 		if (test_clear_buffer_dirty(bh)) {
 			mark_buffer_async_write_endio(bh, handler);
 		} else {
 			unlock_buffer(bh);
 		}
+next:
+		block++;
 	} while ((bh = bh->b_this_page) != head);
 
 	/*
@@ -1753,9 +1751,11 @@ recover:
 	/*
 	 * ENOSPC, or some other error.  We may already have added some
 	 * blocks to the file, so we need to write these out to avoid
-	 * exposing stale data.
+	 * exposing stale data. We redirty the page so that we don't
+	 * loose data we are unable to write.
 	 * The page is currently locked and not marked for writeback
 	 */
+	redirty_page_for_writepage(wbc, page);
 	bh = head;
 	/* Recovery: lock and submit the mapped buffers */
 	do {
@@ -1763,12 +1763,6 @@ recover:
 		    !buffer_delay(bh)) {
 			lock_buffer(bh);
 			mark_buffer_async_write_endio(bh, handler);
-		} else {
-			/*
-			 * The buffer may have been set dirty during
-			 * attachment to a dirty page.
-			 */
-			clear_buffer_dirty(bh);
 		}
 	} while ((bh = bh->b_this_page) != head);
 	SetPageError(page);
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 09/11] vfs: Export wakeup_pdflush
  2009-06-15 17:59 [PATCH 0/10] Fix page_mkwrite() for blocksize < pagesize (version 3) Jan Kara
                   ` (7 preceding siblings ...)
  2009-06-15 17:59 ` [PATCH 08/11] fs: Don't clear dirty bits in block_write_full_page() Jan Kara
@ 2009-06-15 17:59 ` Jan Kara
  2009-06-15 17:59 ` [PATCH 10/11] ext3: Implement delayed allocation on page_mkwrite time Jan Kara
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2009-06-15 17:59 UTC (permalink / raw)
  To: LKML; +Cc: linux-mm, linux-fsdevel, npiggin, Jan Kara

When we are running out of free space on a filesystem, we want to flush dirty
pages in the filesystem so that blocks reserved via delayed allocation get
written and unnecessary block reservation is released. Export wakeup_pdflush()
so that filesystem can start such writeback.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/page-writeback.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index bb553c3..43af1cf 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -730,6 +730,7 @@ int wakeup_pdflush(long nr_pages)
 				global_page_state(NR_UNSTABLE_NFS);
 	return pdflush_operation(background_writeout, nr_pages);
 }
+EXPORT_SYMBOL(wakeup_pdflush);
 
 static void wb_timer_fn(unsigned long unused);
 static void laptop_timer_fn(unsigned long unused);
-- 
1.6.0.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 10/11] ext3: Implement delayed allocation on page_mkwrite time
  2009-06-15 17:59 [PATCH 0/10] Fix page_mkwrite() for blocksize < pagesize (version 3) Jan Kara
                   ` (8 preceding siblings ...)
  2009-06-15 17:59 ` [PATCH 09/11] vfs: Export wakeup_pdflush Jan Kara
@ 2009-06-15 17:59 ` Jan Kara
  2009-06-15 18:02 ` [PATCH 0/10] Fix page_mkwrite() for blocksize < pagesize (version 3) Jan Kara
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2009-06-15 17:59 UTC (permalink / raw)
  To: LKML; +Cc: linux-mm, linux-fsdevel, npiggin, Jan Kara

We don't want to really allocate blocks on page_mkwrite() time because for
random writes via mmap it results is much more fragmented files. So just
reserve enough free blocks in page_mkwrite() and do the real allocation from
writepage().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext3/balloc.c           |   96 ++++++++++++++--------
 fs/ext3/ialloc.c           |    2 +-
 fs/ext3/inode.c            |  196 +++++++++++++++++++++++++++-----------------
 fs/ext3/resize.c           |    2 +-
 fs/ext3/super.c            |   42 ++++++----
 include/linux/ext3_fs.h    |    9 ++-
 include/linux/ext3_fs_i.h  |    1 +
 include/linux/ext3_fs_sb.h |    3 +-
 8 files changed, 219 insertions(+), 132 deletions(-)

diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
index 27967f9..0e294a1 100644
--- a/fs/ext3/balloc.c
+++ b/fs/ext3/balloc.c
@@ -19,6 +19,8 @@
 #include <linux/ext3_jbd.h>
 #include <linux/quotaops.h>
 #include <linux/buffer_head.h>
+#include <linux/delalloc_counter.h>
+#include <linux/writeback.h>
 
 /*
  * balloc.c contains the blocks allocation and deallocation routines
@@ -632,7 +634,7 @@ do_more:
 	spin_lock(sb_bgl_lock(sbi, block_group));
 	le16_add_cpu(&desc->bg_free_blocks_count, group_freed);
 	spin_unlock(sb_bgl_lock(sbi, block_group));
-	percpu_counter_add(&sbi->s_freeblocks_counter, count);
+	dac_free(&sbi->s_alloc_counter, count);
 
 	/* We dirtied the bitmap block */
 	BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
@@ -1410,23 +1412,19 @@ out:
 }
 
 /**
- * ext3_has_free_blocks()
- * @sbi:		in-core super block structure.
+ * ext3_free_blocks_limit()
+ * @sb:			super block
  *
  * Check if filesystem has at least 1 free block available for allocation.
  */
-static int ext3_has_free_blocks(struct ext3_sb_info *sbi)
+ext3_fsblk_t ext3_free_blocks_limit(struct super_block *sb)
 {
-	ext3_fsblk_t free_blocks, root_blocks;
+	struct ext3_sb_info *sbi = EXT3_SB(sb);
 
-	free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
-	root_blocks = le32_to_cpu(sbi->s_es->s_r_blocks_count);
-	if (free_blocks < root_blocks + 1 && !capable(CAP_SYS_RESOURCE) &&
-		sbi->s_resuid != current_fsuid() &&
-		(sbi->s_resgid == 0 || !in_group_p (sbi->s_resgid))) {
-		return 0;
-	}
-	return 1;
+	if (!capable(CAP_SYS_RESOURCE) && sbi->s_resuid != current_fsuid() &&
+	    (sbi->s_resgid == 0 || !in_group_p(sbi->s_resgid)))
+		return le32_to_cpu(sbi->s_es->s_r_blocks_count) + 1;
+	return 0;
 }
 
 /**
@@ -1443,12 +1441,23 @@ static int ext3_has_free_blocks(struct ext3_sb_info *sbi)
  */
 int ext3_should_retry_alloc(struct super_block *sb, int *retries)
 {
-	if (!ext3_has_free_blocks(EXT3_SB(sb)) || (*retries)++ > 3)
+	struct ext3_sb_info *sbi = EXT3_SB(sb);
+	ext3_fsblk_t free_blocks, limit, reserved_blocks;
+
+	free_blocks = dac_get_avail(&sbi->s_alloc_counter);
+	limit = ext3_free_blocks_limit(sb);
+	reserved_blocks = dac_get_reserved(&sbi->s_alloc_counter);
+	if (!free_blocks + reserved_blocks < limit || (*retries)++ > 3)
 		return 0;
 
 	jbd_debug(1, "%s: retrying operation after ENOSPC\n", sb->s_id);
-
-	return journal_force_commit_nested(EXT3_SB(sb)->s_journal);
+	/*
+	 * There's a chance commit will free some blocks and writeback can
+	 * write delayed blocks so that excessive reservation gets released.
+	 */
+	if (reserved_blocks)
+		wakeup_pdflush(0);
+	return journal_force_commit_nested(sbi->s_journal);
 }
 
 /**
@@ -1466,7 +1475,8 @@ int ext3_should_retry_alloc(struct super_block *sb, int *retries)
  *
  */
 ext3_fsblk_t ext3_new_blocks(handle_t *handle, struct inode *inode,
-			ext3_fsblk_t goal, unsigned long *count, int *errp)
+			ext3_fsblk_t goal, unsigned long *count,
+			unsigned int *reserved, int *errp)
 {
 	struct buffer_head *bitmap_bh = NULL;
 	struct buffer_head *gdp_bh;
@@ -1477,7 +1487,7 @@ ext3_fsblk_t ext3_new_blocks(handle_t *handle, struct inode *inode,
 	ext3_fsblk_t ret_block;		/* filesyetem-wide allocated block */
 	int bgi;			/* blockgroup iteration index */
 	int fatal = 0, err;
-	int performed_allocation = 0;
+	int got_quota = 0, got_space = 0;
 	ext3_grpblk_t free_blocks;	/* number of free blocks in a group */
 	struct super_block *sb;
 	struct ext3_group_desc *gdp;
@@ -1498,16 +1508,27 @@ ext3_fsblk_t ext3_new_blocks(handle_t *handle, struct inode *inode,
 		printk("ext3_new_block: nonexistent device");
 		return 0;
 	}
+	sbi = EXT3_SB(sb);
 
-	/*
-	 * Check quota for allocation of this block.
-	 */
-	if (vfs_dq_alloc_block(inode, num)) {
-		*errp = -EDQUOT;
-		return 0;
+	if (!*reserved) {
+		/*
+		 * Check quota for allocation of this block.
+		 */
+		if (vfs_dq_alloc_block(inode, num)) {
+			*errp = -EDQUOT;
+			goto out;
+		}
+		got_quota = 1;
+		if (dac_alloc(&sbi->s_alloc_counter, num,
+			      ext3_free_blocks_limit(sb)) < 0) {
+			*errp = -ENOSPC;
+			goto out;
+		}
+		got_space = 1;
+	} else {
+		WARN_ON(*reserved < num);
 	}
 
-	sbi = EXT3_SB(sb);
 	es = EXT3_SB(sb)->s_es;
 	ext3_debug("goal=%lu.\n", goal);
 	/*
@@ -1522,11 +1543,6 @@ ext3_fsblk_t ext3_new_blocks(handle_t *handle, struct inode *inode,
 	if (block_i && ((windowsz = block_i->rsv_window_node.rsv_goal_size) > 0))
 		my_rsv = &block_i->rsv_window_node;
 
-	if (!ext3_has_free_blocks(sbi)) {
-		*errp = -ENOSPC;
-		goto out;
-	}
-
 	/*
 	 * First, test whether the goal block is free.
 	 */
@@ -1650,8 +1666,6 @@ allocated:
 		goto retry_alloc;
 	}
 
-	performed_allocation = 1;
-
 #ifdef CONFIG_JBD_DEBUG
 	{
 		struct buffer_head *debug_bh;
@@ -1701,7 +1715,6 @@ allocated:
 	spin_lock(sb_bgl_lock(sbi, group_no));
 	le16_add_cpu(&gdp->bg_free_blocks_count, -num);
 	spin_unlock(sb_bgl_lock(sbi, group_no));
-	percpu_counter_sub(&sbi->s_freeblocks_counter, num);
 
 	BUFFER_TRACE(gdp_bh, "journal_dirty_metadata for group descriptor");
 	err = ext3_journal_dirty_metadata(handle, gdp_bh);
@@ -1713,7 +1726,15 @@ allocated:
 
 	*errp = 0;
 	brelse(bitmap_bh);
-	vfs_dq_free_block(inode, *count-num);
+	if (*reserved) {
+		dac_alloc_reserved(&sbi->s_alloc_counter, num);
+		vfs_dq_claim_block(inode, num);
+		atomic_sub(num, &EXT3_I(inode)->i_reserved_blocks);
+		*reserved -= num;
+	} else {
+		dac_free(&sbi->s_alloc_counter, *count - num);
+		vfs_dq_free_block(inode, *count - num);
+	}
 	*count = num;
 	return ret_block;
 
@@ -1727,8 +1748,10 @@ out:
 	/*
 	 * Undo the block allocation
 	 */
-	if (!performed_allocation)
+	if (got_quota)
 		vfs_dq_free_block(inode, *count);
+	if (got_space)
+		dac_free(&sbi->s_alloc_counter, *count);
 	brelse(bitmap_bh);
 	return 0;
 }
@@ -1737,8 +1760,9 @@ ext3_fsblk_t ext3_new_block(handle_t *handle, struct inode *inode,
 			ext3_fsblk_t goal, int *errp)
 {
 	unsigned long count = 1;
+	unsigned int reserved = 0;
 
-	return ext3_new_blocks(handle, inode, goal, &count, errp);
+	return ext3_new_blocks(handle, inode, goal, &count, &reserved, errp);
 }
 
 /**
diff --git a/fs/ext3/ialloc.c b/fs/ext3/ialloc.c
index b399912..347e24c 100644
--- a/fs/ext3/ialloc.c
+++ b/fs/ext3/ialloc.c
@@ -269,7 +269,7 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent)
 
 	freei = percpu_counter_read_positive(&sbi->s_freeinodes_counter);
 	avefreei = freei / ngroups;
-	freeb = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
+	freeb = dac_get_avail(&sbi->s_alloc_counter);
 	avefreeb = freeb / ngroups;
 	ndirs = percpu_counter_read_positive(&sbi->s_dirs_counter);
 
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index ec112b4..cfec38b 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -38,6 +38,7 @@
 #include <linux/bio.h>
 #include <linux/fiemap.h>
 #include <linux/namei.h>
+#include <linux/mount.h>
 #include "xattr.h"
 #include "acl.h"
 
@@ -516,7 +517,8 @@ static int ext3_blks_to_allocate(Indirect *branch, int k, unsigned long blks,
  */
 static int ext3_alloc_blocks(handle_t *handle, struct inode *inode,
 			ext3_fsblk_t goal, int indirect_blks, int blks,
-			ext3_fsblk_t new_blocks[4], int *err)
+			unsigned int *reserved, ext3_fsblk_t new_blocks[4],
+			int *err)
 {
 	int target, i;
 	unsigned long count = 0;
@@ -537,7 +539,8 @@ static int ext3_alloc_blocks(handle_t *handle, struct inode *inode,
 	while (1) {
 		count = target;
 		/* allocating blocks for indirect blocks and direct blocks */
-		current_block = ext3_new_blocks(handle,inode,goal,&count,err);
+		current_block = ext3_new_blocks(handle, inode, goal, &count,
+						reserved, err);
 		if (*err)
 			goto failed_out;
 
@@ -591,8 +594,8 @@ failed_out:
  *	as described above and return 0.
  */
 static int ext3_alloc_branch(handle_t *handle, struct inode *inode,
-			int indirect_blks, int *blks, ext3_fsblk_t goal,
-			int *offsets, Indirect *branch)
+			int indirect_blks, int *blks, unsigned int *reserved,
+			ext3_fsblk_t goal, int *offsets, Indirect *branch)
 {
 	int blocksize = inode->i_sb->s_blocksize;
 	int i, n = 0;
@@ -603,7 +606,7 @@ static int ext3_alloc_branch(handle_t *handle, struct inode *inode,
 	ext3_fsblk_t current_block;
 
 	num = ext3_alloc_blocks(handle, inode, goal, indirect_blks,
-				*blks, new_blocks, &err);
+				*blks, reserved, new_blocks, &err);
 	if (err)
 		return err;
 
@@ -800,6 +803,7 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
 	int depth;
 	struct ext3_inode_info *ei = EXT3_I(inode);
 	int count = 0;
+	unsigned int reserved = 0;
 	ext3_fsblk_t first_block = 0;
 
 
@@ -898,8 +902,24 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
 	/*
 	 * Block out ext3_truncate while we alter the tree
 	 */
-	err = ext3_alloc_branch(handle, inode, indirect_blks, &count, goal,
-				offsets + (partial - chain), partial);
+	if (buffer_delay(bh_result)) {
+		WARN_ON(maxblocks != 1);
+		WARN_ON(!bh_result->b_page || !PageLocked(bh_result->b_page));
+		reserved = EXT3_DA_BLOCK_RESERVE;
+	}
+	err = ext3_alloc_branch(handle, inode, indirect_blks, &count,
+				&reserved, goal, offsets + (partial - chain),
+				partial);
+	/* Release additional reservation we had for this block */
+	if (!err && buffer_delay(bh_result)) {
+		dac_cancel_reserved(&EXT3_SB(inode->i_sb)->s_alloc_counter,
+				    reserved);
+		vfs_dq_release_reservation_block(inode, reserved);
+		atomic_sub(reserved, &ei->i_reserved_blocks);
+		clear_buffer_delay(bh_result);
+	} else if (!err) {
+		set_buffer_new(bh_result);
+	}
 
 	/*
 	 * The ext3_splice_branch call will free and forget any buffers
@@ -914,8 +934,6 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
 	mutex_unlock(&ei->truncate_mutex);
 	if (err)
 		goto cleanup;
-
-	set_buffer_new(bh_result);
 got_it:
 	map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key));
 	if (count > blocks_to_boundary)
@@ -1432,9 +1450,9 @@ static sector_t ext3_bmap(struct address_space *mapping, sector_t block)
 	return generic_block_bmap(mapping,block,ext3_get_block);
 }
 
-static int buffer_unmapped(handle_t *handle, struct buffer_head *bh)
+static int ext3_bh_delay(handle_t *handle, struct buffer_head *bh)
 {
-	return !buffer_mapped(bh);
+	return buffer_delay(bh);
 }
 
 /*
@@ -1496,12 +1514,31 @@ static int ext3_common_writepage(struct page *page,
 				struct writeback_control *wbc)
 {
 	struct inode *inode = page->mapping->host;
-	int ret = 0;
+	int ret = 0, delay = 1, err;
+	handle_t *handle = NULL;
 
+	/* Start a transaction only if the page has delayed buffers */
+	if (page_has_buffers(page))
+		delay = walk_page_buffers(NULL, page_buffers(page), 0,
+				PAGE_CACHE_SIZE, NULL, ext3_bh_delay);
+	if (delay) {
+		handle = ext3_journal_start(inode,
+				ext3_writepage_trans_blocks(inode));
+		if (IS_ERR(handle)) {
+			ret = PTR_ERR(handle);
+			goto out;
+		}
+	}
 	if (test_opt(inode->i_sb, NOBH) && ext3_should_writeback_data(inode))
 		ret = nobh_writepage(page, ext3_get_block, wbc);
 	else
 		ret = block_write_full_page(page, ext3_get_block, wbc);
+	if (delay) {
+		err = ext3_journal_stop(handle);
+		if (!ret)
+			ret = err;
+	}
+out:
 	return ret;
 }
 
@@ -1576,15 +1613,37 @@ ext3_readpages(struct file *file, struct address_space *mapping,
 	return mpage_readpages(mapping, pages, nr_pages, ext3_get_block);
 }
 
+
+static int truncate_delayed_bh(handle_t *handle, struct buffer_head *bh)
+{
+	if (buffer_delay(bh)) {
+		struct inode *inode = bh->b_page->mapping->host;
+		struct ext3_inode_info *ei = EXT3_I(inode);
+
+		atomic_sub(EXT3_DA_BLOCK_RESERVE, &ei->i_reserved_blocks);
+		vfs_dq_release_reservation_block(inode, EXT3_DA_BLOCK_RESERVE);
+		dac_cancel_reserved(&EXT3_SB(inode->i_sb)->s_alloc_counter,
+				    EXT3_DA_BLOCK_RESERVE);
+		clear_buffer_delay(bh);
+	}
+	return 0;
+}
+
 static void ext3_invalidatepage(struct page *page, unsigned long offset)
 {
-	journal_t *journal = EXT3_JOURNAL(page->mapping->host);
+	struct inode *inode = page->mapping->host;
+	journal_t *journal = EXT3_JOURNAL(inode);
+	int bsize = 1 << inode->i_blkbits;
 
 	/*
 	 * If it's a full truncate we just forget about the pending dirtying
 	 */
 	if (offset == 0)
 		ClearPageChecked(page);
+	if (page_has_buffers(page)) {
+		walk_page_buffers(NULL, page_buffers(page), offset + bsize - 1,
+				  PAGE_CACHE_SIZE, NULL, truncate_delayed_bh);
+	}
 
 	journal_invalidatepage(journal, page, offset);
 }
@@ -1685,75 +1744,58 @@ out:
 	return ret;
 }
 
-int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+/*
+ * Reserve block writes instead of allocation. Called only on buffer heads
+ * attached to a page (and thus for 1 block).
+ */
+static int ext3_da_get_block(struct inode *inode, sector_t iblock,
+			     struct buffer_head *bh, int create)
 {
-	struct page *page = vmf->page;
-	struct file *file = vma->vm_file;
-	struct address_space *mapping = file->f_mapping;
-	struct inode *inode = file->f_path.dentry->d_inode;
-	int ret = VM_FAULT_NOPAGE;
-	loff_t size;
-	int len;
-	void *fsdata;
-
-	block_wait_on_hole_extend(inode, page_offset(page));
-	/*
-	 * Get i_alloc_sem to stop truncates messing with the inode. We cannot
-	 * get i_mutex because we are already holding mmap_sem.
-	 */
-	down_read(&inode->i_alloc_sem);
-	size = i_size_read(inode);
-	if ((page->mapping != inode->i_mapping) ||
-	    (page_offset(page) > size)) {
-		/* page got truncated out from underneath us */
-		goto out_unlock;
-	}
-
-	/* page is wholly or partially inside EOF */
-	if (((page->index + 1) << PAGE_CACHE_SHIFT) > size)
-		len = size & ~PAGE_CACHE_MASK;
-	else
-		len = PAGE_CACHE_SIZE;
+	int ret, rsv;
+	struct ext3_sb_info *sbi;
 
-	/*
-	 * Check for the common case that everything is already mapped. We
-	 * have to get the page lock so that buffers cannot be released
-	 * under us.
-	 */
-	lock_page(page);
-	if (page_has_buffers(page)) {
-		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
-				       buffer_unmapped)) {
-			unlock_page(page);
-			ret = 0;
-			goto out_unlock;
-		}
-	}
-	unlock_page(page);
+	/* Buffer has already blocks reserved? */
+	if (buffer_delay(bh))
+		return 0;
 
-	/*
-	 * OK, we may need to fill the hole... Do write_begin write_end to do
-	 * block allocation/reservation. We are not holding inode.i_mutex
-	 * here. That allows parallel write_begin, write_end call. lock_page
-	 * prevent this from happening on the same page though.
-	 */
-	ret = mapping->a_ops->write_begin(file, mapping, page_offset(page),
-			len, AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
+	ret = ext3_get_blocks_handle(NULL, inode, iblock, 1, bh, 0);
 	if (ret < 0)
-		goto out_unlock;
-	ret = mapping->a_ops->write_end(file, mapping, page_offset(page),
-			len, len, page, fsdata);
-	if (ret < 0)
-		goto out_unlock;
-	ret = 0;
-out_unlock:
-	if (unlikely(ret)) {
-		if (ret == -ENOMEM)
-			ret = VM_FAULT_OOM;
-		else /* -ENOSPC, -EIO, etc */
-			ret = VM_FAULT_SIGBUS;
+		goto out;
+	if (ret > 0 || !create) {
+		ret = 0;
+		goto out;
+	}
+	/* Upperbound on number of needed blocks */
+	rsv = EXT3_DA_BLOCK_RESERVE;
+	/* Delayed allocation needed */
+	if (vfs_dq_reserve_block(inode, rsv)) {
+		ret = -EDQUOT;
+		goto out;
+	}
+	sbi = EXT3_SB(inode->i_sb);
+	ret = dac_reserve(&sbi->s_alloc_counter, rsv,
+			  ext3_free_blocks_limit(inode->i_sb));
+	if (ret < 0) {
+		vfs_dq_release_reservation_block(inode, rsv);
+		goto out;
 	}
-	up_read(&inode->i_alloc_sem);
+	set_buffer_delay(bh);
+	set_buffer_new(bh);
+	atomic_add(rsv, &EXT3_I(inode)->i_reserved_blocks);
+out:
+	return ret;
+}
+
+int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	int retry = 0;
+	int ret;
+	struct super_block *sb = vma->vm_file->f_path.mnt->mnt_sb;
+
+	do {
+		ret = block_page_mkwrite(vma, vmf, ext3_da_get_block);
+	} while (ret == VM_FAULT_SIGBUS &&
+		 ext3_should_retry_alloc(sb, &retry));
 	return ret;
 }
 
diff --git a/fs/ext3/resize.c b/fs/ext3/resize.c
index 8a0b263..2f6688a 100644
--- a/fs/ext3/resize.c
+++ b/fs/ext3/resize.c
@@ -928,7 +928,7 @@ int ext3_group_add(struct super_block *sb, struct ext3_new_group_data *input)
 	le32_add_cpu(&es->s_r_blocks_count, input->reserved_blocks);
 
 	/* Update the free space counts */
-	percpu_counter_add(&sbi->s_freeblocks_counter,
+	percpu_counter_add(&sbi->s_alloc_counter.free,
 			   input->free_blocks_count);
 	percpu_counter_add(&sbi->s_freeinodes_counter,
 			   EXT3_INODES_PER_GROUP(sb));
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 26aa64d..036e953 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -417,7 +417,7 @@ static void ext3_put_super (struct super_block * sb)
 	for (i = 0; i < sbi->s_gdb_count; i++)
 		brelse(sbi->s_group_desc[i]);
 	kfree(sbi->s_group_desc);
-	percpu_counter_destroy(&sbi->s_freeblocks_counter);
+	dac_destroy(&sbi->s_alloc_counter);
 	percpu_counter_destroy(&sbi->s_freeinodes_counter);
 	percpu_counter_destroy(&sbi->s_dirs_counter);
 	brelse(sbi->s_sbh);
@@ -494,6 +494,7 @@ static void init_once(void *foo)
 #ifdef CONFIG_EXT3_FS_XATTR
 	init_rwsem(&ei->xattr_sem);
 #endif
+	atomic_set(&ei->i_reserved_blocks, 0);
 	mutex_init(&ei->truncate_mutex);
 	inode_init_once(&ei->vfs_inode);
 }
@@ -517,23 +518,26 @@ static void destroy_inodecache(void)
 
 static void ext3_clear_inode(struct inode *inode)
 {
-	struct ext3_block_alloc_info *rsv = EXT3_I(inode)->i_block_alloc_info;
+	struct ext3_inode_info *ei = EXT3_I(inode);
+	struct ext3_block_alloc_info *rsv = ei->i_block_alloc_info;
 #ifdef CONFIG_EXT3_FS_POSIX_ACL
-	if (EXT3_I(inode)->i_acl &&
-			EXT3_I(inode)->i_acl != EXT3_ACL_NOT_CACHED) {
-		posix_acl_release(EXT3_I(inode)->i_acl);
-		EXT3_I(inode)->i_acl = EXT3_ACL_NOT_CACHED;
+	if (ei->i_acl && ei->i_acl != EXT3_ACL_NOT_CACHED) {
+		posix_acl_release(ei->i_acl);
+		ei->i_acl = EXT3_ACL_NOT_CACHED;
 	}
-	if (EXT3_I(inode)->i_default_acl &&
-			EXT3_I(inode)->i_default_acl != EXT3_ACL_NOT_CACHED) {
-		posix_acl_release(EXT3_I(inode)->i_default_acl);
-		EXT3_I(inode)->i_default_acl = EXT3_ACL_NOT_CACHED;
+	if (ei->i_default_acl && ei->i_default_acl != EXT3_ACL_NOT_CACHED) {
+		posix_acl_release(ei->i_default_acl);
+		ei->i_default_acl = EXT3_ACL_NOT_CACHED;
 	}
 #endif
 	ext3_discard_reservation(inode);
-	EXT3_I(inode)->i_block_alloc_info = NULL;
+	ei->i_block_alloc_info = NULL;
 	if (unlikely(rsv))
 		kfree(rsv);
+	if (atomic_read(&ei->i_reserved_blocks))
+		ext3_warning(inode->i_sb, __func__, "Releasing inode %lu with "
+			"%lu reserved blocks.\n", inode->i_ino,
+			(unsigned long)atomic_read(&ei->i_reserved_blocks));
 }
 
 static inline void ext3_show_quota_options(struct seq_file *seq, struct super_block *sb)
@@ -728,10 +732,19 @@ static ssize_t ext3_quota_read(struct super_block *sb, int type, char *data,
 static ssize_t ext3_quota_write(struct super_block *sb, int type,
 				const char *data, size_t len, loff_t off);
 
+static qsize_t ext3_get_reserved_space(struct inode *inode)
+{
+	return atomic_read(&EXT3_I(inode)->i_reserved_blocks);
+}
+
 static struct dquot_operations ext3_quota_operations = {
 	.initialize	= dquot_initialize,
 	.drop		= dquot_drop,
 	.alloc_space	= dquot_alloc_space,
+	.reserve_space	= dquot_reserve_space,
+	.claim_space	= dquot_claim_space,
+	.release_rsv	= dquot_release_reserved_space,
+	.get_reserved_space = ext3_get_reserved_space,
 	.alloc_inode	= dquot_alloc_inode,
 	.free_space	= dquot_free_space,
 	.free_inode	= dquot_free_inode,
@@ -1851,8 +1864,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 	get_random_bytes(&sbi->s_next_generation, sizeof(u32));
 	spin_lock_init(&sbi->s_next_gen_lock);
 
-	err = percpu_counter_init(&sbi->s_freeblocks_counter,
-			ext3_count_free_blocks(sb));
+	err = dac_init(&sbi->s_alloc_counter, ext3_count_free_blocks(sb));
 	if (!err) {
 		err = percpu_counter_init(&sbi->s_freeinodes_counter,
 				ext3_count_free_inodes(sb));
@@ -2005,7 +2017,7 @@ cantfind_ext3:
 failed_mount4:
 	journal_destroy(sbi->s_journal);
 failed_mount3:
-	percpu_counter_destroy(&sbi->s_freeblocks_counter);
+	dac_destroy(&sbi->s_alloc_counter);
 	percpu_counter_destroy(&sbi->s_freeinodes_counter);
 	percpu_counter_destroy(&sbi->s_dirs_counter);
 failed_mount2:
@@ -2674,7 +2686,7 @@ static int ext3_statfs (struct dentry * dentry, struct kstatfs * buf)
 	buf->f_type = EXT3_SUPER_MAGIC;
 	buf->f_bsize = sb->s_blocksize;
 	buf->f_blocks = le32_to_cpu(es->s_blocks_count) - sbi->s_overhead_last;
-	buf->f_bfree = percpu_counter_sum_positive(&sbi->s_freeblocks_counter);
+	buf->f_bfree = dac_get_avail_sum(&sbi->s_alloc_counter);
 	es->s_free_blocks_count = cpu_to_le32(buf->f_bfree);
 	buf->f_bavail = buf->f_bfree - le32_to_cpu(es->s_r_blocks_count);
 	if (buf->f_bfree < le32_to_cpu(es->s_r_blocks_count))
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index 5051874..7f28907 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -809,6 +809,11 @@ ext3_group_first_block_no(struct super_block *sb, unsigned long group_no)
 #define ERR_BAD_DX_DIR	-75000
 
 /*
+ * Number of blocks we reserve in delayed allocation for one block
+ */
+#define EXT3_DA_BLOCK_RESERVE 4
+
+/*
  * Function prototypes
  */
 
@@ -821,12 +826,14 @@ ext3_group_first_block_no(struct super_block *sb, unsigned long group_no)
 # define NORET_AND     noreturn,
 
 /* balloc.c */
+extern ext3_fsblk_t ext3_free_blocks_limit(struct super_block *sb);
 extern int ext3_bg_has_super(struct super_block *sb, int group);
 extern unsigned long ext3_bg_num_gdb(struct super_block *sb, int group);
 extern ext3_fsblk_t ext3_new_block (handle_t *handle, struct inode *inode,
 			ext3_fsblk_t goal, int *errp);
 extern ext3_fsblk_t ext3_new_blocks (handle_t *handle, struct inode *inode,
-			ext3_fsblk_t goal, unsigned long *count, int *errp);
+			ext3_fsblk_t goal, unsigned long *count,
+			unsigned int *reserved, int *errp);
 extern void ext3_free_blocks (handle_t *handle, struct inode *inode,
 			ext3_fsblk_t block, unsigned long count);
 extern void ext3_free_blocks_sb (handle_t *handle, struct super_block *sb,
diff --git a/include/linux/ext3_fs_i.h b/include/linux/ext3_fs_i.h
index 7894dd0..ef288bd 100644
--- a/include/linux/ext3_fs_i.h
+++ b/include/linux/ext3_fs_i.h
@@ -129,6 +129,7 @@ struct ext3_inode_info {
 
 	/* on-disk additional length */
 	__u16 i_extra_isize;
+	atomic_t i_reserved_blocks;
 
 	/*
 	 * truncate_mutex is for serialising ext3_truncate() against
diff --git a/include/linux/ext3_fs_sb.h b/include/linux/ext3_fs_sb.h
index f07f34d..f92e162 100644
--- a/include/linux/ext3_fs_sb.h
+++ b/include/linux/ext3_fs_sb.h
@@ -21,6 +21,7 @@
 #include <linux/wait.h>
 #include <linux/blockgroup_lock.h>
 #include <linux/percpu_counter.h>
+#include <linux/delalloc_counter.h>
 #endif
 #include <linux/rbtree.h>
 
@@ -58,7 +59,7 @@ struct ext3_sb_info {
 	u32 s_hash_seed[4];
 	int s_def_hash_version;
 	int s_hash_unsigned;	/* 3 if hash should be signed, 0 if not */
-	struct percpu_counter s_freeblocks_counter;
+	struct delalloc_counter s_alloc_counter;
 	struct percpu_counter s_freeinodes_counter;
 	struct percpu_counter s_dirs_counter;
 	struct blockgroup_lock *s_blockgroup_lock;
-- 
1.6.0.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/10] Fix page_mkwrite() for blocksize < pagesize (version 3)
  2009-06-15 17:59 [PATCH 0/10] Fix page_mkwrite() for blocksize < pagesize (version 3) Jan Kara
                   ` (9 preceding siblings ...)
  2009-06-15 17:59 ` [PATCH 10/11] ext3: Implement delayed allocation on page_mkwrite time Jan Kara
@ 2009-06-15 18:02 ` Jan Kara
  2009-06-15 18:17 ` Aneesh Kumar K.V
  2009-06-16 14:34 ` Christoph Hellwig
  12 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2009-06-15 18:02 UTC (permalink / raw)
  To: LKML; +Cc: linux-mm, linux-fsdevel, npiggin

> 
> patches below are an attempt to solve problems filesystems have with
> page_mkwrite() when blocksize < pagesize (see the changelog of the second patch
> for details).
> 
> Could someone please review them so that they can get merged - especially the
> generic VFS/MM part? It fixes observed problems (WARN_ON triggers) for ext4 and
> makes ext2/ext3 behave more nicely (mmapped write getting page fault instead
> of silently discarding data).
> 
> The series is against Linus's tree from today. The differences against previous
> version are one bugfix in ext3 delalloc implementation... Please test and review.
> Thanks.
  Sorry for the wrong patch numbering [x/11]. There are really just 10
patches in the series. The eleventh one is just a debugging patch I
don't want to merge.

									Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/10] Fix page_mkwrite() for blocksize < pagesize (version 3)
  2009-06-15 17:59 [PATCH 0/10] Fix page_mkwrite() for blocksize < pagesize (version 3) Jan Kara
                   ` (10 preceding siblings ...)
  2009-06-15 18:02 ` [PATCH 0/10] Fix page_mkwrite() for blocksize < pagesize (version 3) Jan Kara
@ 2009-06-15 18:17 ` Aneesh Kumar K.V
  2009-06-16 10:28   ` Jan Kara
  2009-06-16 14:34 ` Christoph Hellwig
  12 siblings, 1 reply; 39+ messages in thread
From: Aneesh Kumar K.V @ 2009-06-15 18:17 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, linux-mm, linux-fsdevel, npiggin

On Mon, Jun 15, 2009 at 07:59:47PM +0200, Jan Kara wrote:
> 
> patches below are an attempt to solve problems filesystems have with
> page_mkwrite() when blocksize < pagesize (see the changelog of the second patch
> for details).
> 
> Could someone please review them so that they can get merged - especially the
> generic VFS/MM part? It fixes observed problems (WARN_ON triggers) for ext4 and
> makes ext2/ext3 behave more nicely (mmapped write getting page fault instead
> of silently discarding data).


Will you be able to send it as two series.

a) One that fix the blocksize < page size bug
b) making ext2/3 mmaped write give better allocation pattern.

Doing that will make sure (a) can go in this merge window. There are
other ext4 fixes waiting for (a) to be merged in.


> 
> The series is against Linus's tree from today. The differences against previous
> version are one bugfix in ext3 delalloc implementation... Please test and review.
> Thanks.

-aneesh

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/10] Fix page_mkwrite() for blocksize < pagesize (version 3)
  2009-06-15 18:17 ` Aneesh Kumar K.V
@ 2009-06-16 10:28   ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2009-06-16 10:28 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Jan Kara, LKML, linux-mm, linux-fsdevel, npiggin

  Hi,

On Mon 15-06-09 23:47:53, Aneesh Kumar K.V wrote:
> On Mon, Jun 15, 2009 at 07:59:47PM +0200, Jan Kara wrote:
> > 
> > patches below are an attempt to solve problems filesystems have with
> > page_mkwrite() when blocksize < pagesize (see the changelog of the second patch
> > for details).
> > 
> > Could someone please review them so that they can get merged - especially the
> > generic VFS/MM part? It fixes observed problems (WARN_ON triggers) for ext4 and
> > makes ext2/ext3 behave more nicely (mmapped write getting page fault instead
> > of silently discarding data).
> 
> Will you be able to send it as two series.
> 
> a) One that fix the blocksize < page size bug
> b) making ext2/3 mmaped write give better allocation pattern.
> 
> Doing that will make sure (a) can go in this merge window. There are
> other ext4 fixes waiting for (a) to be merged in.
  Of course, there is no problem in merging just patches 2, 4 which are
needed for ext4, and leave the rest for the next merge window. Actually,
I'd rather leave at least ext3 patch for the next merge window because that
has the highest chance of breaking something...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/10] Fix page_mkwrite() for blocksize < pagesize (version 3)
  2009-06-15 17:59 [PATCH 0/10] Fix page_mkwrite() for blocksize < pagesize (version 3) Jan Kara
                   ` (11 preceding siblings ...)
  2009-06-15 18:17 ` Aneesh Kumar K.V
@ 2009-06-16 14:34 ` Christoph Hellwig
  2009-06-16 14:42   ` Jan Kara
  12 siblings, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2009-06-16 14:34 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, linux-mm, linux-fsdevel, npiggin

On Mon, Jun 15, 2009 at 07:59:47PM +0200, Jan Kara wrote:
> 
> patches below are an attempt to solve problems filesystems have with
> page_mkwrite() when blocksize < pagesize (see the changelog of the second patch
> for details).

It would be useful if you had a test case reproducing these issues,
so that I can verify how well your patches work in various scenarios.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/10] Fix page_mkwrite() for blocksize < pagesize (version 3)
  2009-06-16 14:34 ` Christoph Hellwig
@ 2009-06-16 14:42   ` Jan Kara
  2009-06-30 17:44     ` Christoph Hellwig
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Kara @ 2009-06-16 14:42 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: LKML, linux-mm, linux-fsdevel, npiggin

[-- Attachment #1: Type: text/plain, Size: 739 bytes --]

On Tue 16-06-09 10:34:24, Christoph Hellwig wrote:
> On Mon, Jun 15, 2009 at 07:59:47PM +0200, Jan Kara wrote:
> > 
> > patches below are an attempt to solve problems filesystems have with
> > page_mkwrite() when blocksize < pagesize (see the changelog of the second patch
> > for details).
> 
> It would be useful if you had a test case reproducing these issues,
> so that I can verify how well your patches work in various scenarios.
  Good point, I should have mentioned in the changelog: fsx-linux is able
to trigger the problem quite quickly.
  I have also written a simple program for initial testing of the fix
(works only for 1K blocksize and 4K pagesize) - it's attached.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: test-mkwrite.c --]
[-- Type: text/x-c++src, Size: 1242 bytes --]

#define _XOPEN_SOURCE 500
#define _GNU_SOURCE
#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/mman.h>

#define MMAP_THREADS 4

void map_file(int fd)
{
	char *addr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	int i;

	if (addr == MAP_FAILED) {
		perror("mmap");
		exit(1);
	}
	while (1) {
		for (i = 0; i < 128; i++)
			addr[i + 1024] = 'c';
	}
}

char buf2[4096] __attribute__ ((aligned (4096)));

int main(int argc, char **argv)
{
	int i;
	int ret;
	int fd;
	char buf[1024];

	if (argc != 2) {
		printf("Usage: test-mkwrite <file>\n");
		return 1;
	}
	fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644);
	if (fd < 0) {
		perror("open");
		return 1;
	}
	memset(buf, 'a', sizeof(buf));
	pwrite(fd, buf, sizeof(buf), 0);

	for (i = 0; i < MMAP_THREADS; i++) {
		ret = fork();
		if (ret < 0) {
			perror("fork");
			return 1;
		} else if (ret == 0)
			map_file(fd);
	}
	close(fd);

	memset(buf2, 'b', sizeof(buf2));
	fd = open(argv[1], O_RDWR | O_DIRECT);
	if (fd < 0) {
		perror("dopen");
		return 1;
	}
	while (1) {
		if (pwrite(fd, buf2, 4096, 4096) < 0)
			perror("pwrite");
		usleep(10000000);
		ftruncate(fd, 1024);
		usleep(10000000);
	}
	
	return 0;
}

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 01/11] ext3: Get rid of extenddisksize parameter of ext3_get_blocks_handle()
  2009-06-15 17:59 ` [PATCH 01/11] ext3: Get rid of extenddisksize parameter of ext3_get_blocks_handle() Jan Kara
@ 2009-06-17 10:28   ` Nick Piggin
  2009-06-17 11:49     ` Jan Kara
  0 siblings, 1 reply; 39+ messages in thread
From: Nick Piggin @ 2009-06-17 10:28 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, linux-mm, linux-fsdevel

On Mon, Jun 15, 2009 at 07:59:48PM +0200, Jan Kara wrote:
> Get rid of extenddisksize parameter of ext3_get_blocks_handle(). This seems to
> be a relict from some old days and setting disksize in this function does not
> make much sence. Currently it was set only by ext3_getblk().  Since the
> parameter has some effect only if create == 1, it is easy to check that the
> three callers which end up calling ext3_getblk() with create == 1 (ext3_append,
> ext3_quota_write, ext3_mkdir) do the right thing and set disksize themselves.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

I guess something like this should just go in this merge window if
ext3 developers are happy with it? There is no real reason to keep
it with the main patchset is there?



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 07/11] vfs: Unmap underlying metadata of new data buffers only when buffer is mapped
  2009-06-15 17:59 ` [PATCH 07/11] vfs: Unmap underlying metadata of new data buffers only when buffer is mapped Jan Kara
@ 2009-06-17 10:35   ` Nick Piggin
  2009-06-17 12:05     ` Jan Kara
  2009-06-18 11:51   ` OGAWA Hirofumi
  1 sibling, 1 reply; 39+ messages in thread
From: Nick Piggin @ 2009-06-17 10:35 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, linux-mm, linux-fsdevel

On Mon, Jun 15, 2009 at 07:59:54PM +0200, Jan Kara wrote:
> When we do delayed allocation of some buffer, we want to signal to VFS that
> the buffer is new (set buffer_new) so that it properly zeros out everything.
> But we don't have the buffer mapped yet so we cannot really unmap underlying
> metadata in this state. Make VFS avoid doing unmapping of metadata when the
> buffer is not yet mapped.

Is this a seperate bugfix for delalloc filesystems? What is the error
case of attempting to unmap underlying metadata of non mapped buffer?
Won't translate to a serious bug will it?

> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/buffer.c |   12 +++++++-----
>  1 files changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 80e2630..7eb1710 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1683,8 +1683,9 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
>  			if (buffer_new(bh)) {
>  				/* blockdev mappings never come here */
>  				clear_buffer_new(bh);
> -				unmap_underlying_metadata(bh->b_bdev,
> -							bh->b_blocknr);
> +				if (buffer_mapped(bh))
> +					unmap_underlying_metadata(bh->b_bdev,
> +						bh->b_blocknr);
>  			}
>  		}
>  		bh = bh->b_this_page;
> @@ -1869,8 +1870,9 @@ static int __block_prepare_write(struct inode *inode, struct page *page,
>  			if (err)
>  				break;
>  			if (buffer_new(bh)) {
> -				unmap_underlying_metadata(bh->b_bdev,
> -							bh->b_blocknr);
> +				if (buffer_mapped(bh))
> +					unmap_underlying_metadata(bh->b_bdev,
> +						bh->b_blocknr);
>  				if (PageUptodate(page)) {
>  					clear_buffer_new(bh);
>  					set_buffer_uptodate(bh);
> @@ -2683,7 +2685,7 @@ int nobh_write_begin(struct file *file, struct address_space *mapping,
>  			goto failed;
>  		if (!buffer_mapped(bh))
>  			is_mapped_to_disk = 0;
> -		if (buffer_new(bh))
> +		if (buffer_new(bh) && buffer_mapped(bh))
>  			unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
>  		if (PageUptodate(page)) {
>  			set_buffer_uptodate(bh);
> -- 
> 1.6.0.2

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 01/11] ext3: Get rid of extenddisksize parameter of ext3_get_blocks_handle()
  2009-06-17 10:28   ` Nick Piggin
@ 2009-06-17 11:49     ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2009-06-17 11:49 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Jan Kara, LKML, linux-mm, linux-fsdevel

On Wed 17-06-09 12:28:10, Nick Piggin wrote:
> On Mon, Jun 15, 2009 at 07:59:48PM +0200, Jan Kara wrote:
> > Get rid of extenddisksize parameter of ext3_get_blocks_handle(). This seems to
> > be a relict from some old days and setting disksize in this function does not
> > make much sence. Currently it was set only by ext3_getblk().  Since the
> > parameter has some effect only if create == 1, it is easy to check that the
> > three callers which end up calling ext3_getblk() with create == 1 (ext3_append,
> > ext3_quota_write, ext3_mkdir) do the right thing and set disksize themselves.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> I guess something like this should just go in this merge window if
> ext3 developers are happy with it? There is no real reason to keep
> it with the main patchset is there?
  Yes, I'll merge this separately. It's in the series just because ext3
patches depend on it.

									Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 07/11] vfs: Unmap underlying metadata of new data buffers only when buffer is mapped
  2009-06-17 10:35   ` Nick Piggin
@ 2009-06-17 12:05     ` Jan Kara
  2009-06-17 13:53       ` Nick Piggin
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Kara @ 2009-06-17 12:05 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Jan Kara, LKML, linux-mm, linux-fsdevel

On Wed 17-06-09 12:35:43, Nick Piggin wrote:
> On Mon, Jun 15, 2009 at 07:59:54PM +0200, Jan Kara wrote:
> > When we do delayed allocation of some buffer, we want to signal to VFS that
> > the buffer is new (set buffer_new) so that it properly zeros out everything.
> > But we don't have the buffer mapped yet so we cannot really unmap underlying
> > metadata in this state. Make VFS avoid doing unmapping of metadata when the
> > buffer is not yet mapped.
> 
> Is this a seperate bugfix for delalloc filesystems? What is the error
> case of attempting to unmap underlying metadata of non mapped buffer?
> Won't translate to a serious bug will it?
  If you do unmap_underlying_metadata on !mapped buffer, the kernel will
oops because it will try to dereference bh->b_bdev which is NULL. Ext4 or
XFS workaround this issue by setting b_bdev to the real device and b_blocknr
to ~0 so unmap_underlying_metadata does not oops.  As I didn't want to do
the same hack in ext3, I need this patch...
  You're right it's not directly connected with the mkwrite problem and
can go in separately. Given how late it is, I'd like to get patch number 2
reviewed (generic mkwrite changes), so that it can go together with patch
number 4 (ext4 fixes) in the current merge window. The rest is not that
urgent since it's not oopsable and you can hit it only when running out
of space (or hitting quota limit)...

								Honza

> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/buffer.c |   12 +++++++-----
> >  1 files changed, 7 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index 80e2630..7eb1710 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -1683,8 +1683,9 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
> >  			if (buffer_new(bh)) {
> >  				/* blockdev mappings never come here */
> >  				clear_buffer_new(bh);
> > -				unmap_underlying_metadata(bh->b_bdev,
> > -							bh->b_blocknr);
> > +				if (buffer_mapped(bh))
> > +					unmap_underlying_metadata(bh->b_bdev,
> > +						bh->b_blocknr);
> >  			}
> >  		}
> >  		bh = bh->b_this_page;
> > @@ -1869,8 +1870,9 @@ static int __block_prepare_write(struct inode *inode, struct page *page,
> >  			if (err)
> >  				break;
> >  			if (buffer_new(bh)) {
> > -				unmap_underlying_metadata(bh->b_bdev,
> > -							bh->b_blocknr);
> > +				if (buffer_mapped(bh))
> > +					unmap_underlying_metadata(bh->b_bdev,
> > +						bh->b_blocknr);
> >  				if (PageUptodate(page)) {
> >  					clear_buffer_new(bh);
> >  					set_buffer_uptodate(bh);
> > @@ -2683,7 +2685,7 @@ int nobh_write_begin(struct file *file, struct address_space *mapping,
> >  			goto failed;
> >  		if (!buffer_mapped(bh))
> >  			is_mapped_to_disk = 0;
> > -		if (buffer_new(bh))
> > +		if (buffer_new(bh) && buffer_mapped(bh))
> >  			unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
> >  		if (PageUptodate(page)) {
> >  			set_buffer_uptodate(bh);
> > -- 
> > 1.6.0.2
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 07/11] vfs: Unmap underlying metadata of new data buffers only when buffer is mapped
  2009-06-17 12:05     ` Jan Kara
@ 2009-06-17 13:53       ` Nick Piggin
  2009-06-18 12:00         ` Theodore Tso
  0 siblings, 1 reply; 39+ messages in thread
From: Nick Piggin @ 2009-06-17 13:53 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, linux-mm, linux-fsdevel

On Wed, Jun 17, 2009 at 02:05:20PM +0200, Jan Kara wrote:
> On Wed 17-06-09 12:35:43, Nick Piggin wrote:
> > On Mon, Jun 15, 2009 at 07:59:54PM +0200, Jan Kara wrote:
> > > When we do delayed allocation of some buffer, we want to signal to VFS that
> > > the buffer is new (set buffer_new) so that it properly zeros out everything.
> > > But we don't have the buffer mapped yet so we cannot really unmap underlying
> > > metadata in this state. Make VFS avoid doing unmapping of metadata when the
> > > buffer is not yet mapped.
> > 
> > Is this a seperate bugfix for delalloc filesystems? What is the error
> > case of attempting to unmap underlying metadata of non mapped buffer?
> > Won't translate to a serious bug will it?
>   If you do unmap_underlying_metadata on !mapped buffer, the kernel will
> oops because it will try to dereference bh->b_bdev which is NULL. Ext4 or
> XFS workaround this issue by setting b_bdev to the real device and b_blocknr
> to ~0 so unmap_underlying_metadata does not oops.  As I didn't want to do
> the same hack in ext3, I need this patch...

OK, just trying to understand the patchset. It would be nice to
merge this ASAP as well and remove the ext4 and xfs hacks.


>   You're right it's not directly connected with the mkwrite problem and
> can go in separately. Given how late it is, I'd like to get patch number 2
> reviewed (generic mkwrite changes), so that it can go together with patch
> number 4 (ext4 fixes) in the current merge window. The rest is not that
> urgent since it's not oopsable and you can hit it only when running out
> of space (or hitting quota limit)...

Sorry I was so late with looking at it. I am reading it now though
(especially #2) ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 07/11] vfs: Unmap underlying metadata of new data buffers only when buffer is mapped
  2009-06-15 17:59 ` [PATCH 07/11] vfs: Unmap underlying metadata of new data buffers only when buffer is mapped Jan Kara
  2009-06-17 10:35   ` Nick Piggin
@ 2009-06-18 11:51   ` OGAWA Hirofumi
  1 sibling, 0 replies; 39+ messages in thread
From: OGAWA Hirofumi @ 2009-06-18 11:51 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, linux-mm, linux-fsdevel, npiggin

Jan Kara <jack@suse.cz> writes:

> When we do delayed allocation of some buffer, we want to signal to VFS that
> the buffer is new (set buffer_new) so that it properly zeros out everything.
> But we don't have the buffer mapped yet so we cannot really unmap underlying
> metadata in this state. Make VFS avoid doing unmapping of metadata when the
> buffer is not yet mapped.

I may be missing something. However, isn't the delalloc buffer ==
(buffer_mapped() | buffer_delay())? Well, anyway, if buffer is not
buffer_mapped(), e.g. truncate() works properly?

Thanks.

> diff --git a/fs/buffer.c b/fs/buffer.c
> index 80e2630..7eb1710 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1683,8 +1683,9 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
>  			if (buffer_new(bh)) {
>  				/* blockdev mappings never come here */
>  				clear_buffer_new(bh);
> -				unmap_underlying_metadata(bh->b_bdev,
> -							bh->b_blocknr);
> +				if (buffer_mapped(bh))
> +					unmap_underlying_metadata(bh->b_bdev,
> +						bh->b_blocknr);

Is this needed for writepage?

>  			}
>  		}
>  		bh = bh->b_this_page;
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 07/11] vfs: Unmap underlying metadata of new data buffers only when buffer is mapped
  2009-06-17 13:53       ` Nick Piggin
@ 2009-06-18 12:00         ` Theodore Tso
  0 siblings, 0 replies; 39+ messages in thread
From: Theodore Tso @ 2009-06-18 12:00 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Jan Kara, LKML, linux-mm, linux-fsdevel

On Wed, Jun 17, 2009 at 03:53:31PM +0200, Nick Piggin wrote:
> >   You're right it's not directly connected with the mkwrite problem and
> > can go in separately. Given how late it is, I'd like to get patch number 2
> > reviewed (generic mkwrite changes), so that it can go together with patch
> > number 4 (ext4 fixes) in the current merge window. The rest is not that
> > urgent since it's not oopsable and you can hit it only when running out
> > of space (or hitting quota limit)...
> 
> Sorry I was so late with looking at it. I am reading it now though
> (especially #2) ;)

Thanks, as I have additional ext4 patches which are queued up and
blocked behind these patches.  :-)

						 - Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 02/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize
  2009-06-15 17:59 ` [PATCH 02/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize Jan Kara
@ 2009-06-25 16:17   ` Nick Piggin
  2009-06-25 16:43     ` Nick Piggin
                       ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Nick Piggin @ 2009-06-25 16:17 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, linux-mm, linux-fsdevel

On Mon, Jun 15, 2009 at 07:59:49PM +0200, Jan Kara wrote:
> page_mkwrite() is meant to be used by filesystems to allocate blocks under a
> page which is becoming writeably mmapped in some process address space. This
> allows a filesystem to return a page fault if there is not enough space
> available, user exceeds quota or similar problem happens, rather than silently
> discarding data later when writepage is called.
> 
> On filesystems where blocksize < pagesize the situation is more complicated.
> Think for example that blocksize = 1024, pagesize = 4096 and a process does:
>   ftruncate(fd, 0);
>   pwrite(fd, buf, 1024, 0);
>   map = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0);
>   map[0] = 'a';  ----> page_mkwrite() for index 0 is called
>   ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
>   fsync(fd); ----> writepage() for index 0 is called
> 
> At the moment page_mkwrite() is called, filesystem can allocate only one block
> for the page because i_size == 1024. Otherwise it would create blocks beyond
> i_size which is generally undesirable. But later at writepage() time, we would
> like to have blocks allocated for the whole page (and in principle we have to
> allocate them because user could have filled the page with data after the
> second ftruncate()). This patch introduces a framework which allows filesystems
> to handle this with a reasonable effort.
> 
> The idea is following: Before we extend i_size, we obtain a special lock blocking
> page_mkwrite() on the page straddling i_size. Then we writeprotect the page,
> change i_size and unlock the special lock. This way, page_mkwrite() is called for
> a page each time a number of blocks needed to be allocated for a page increases.


Sorry for late reply here, I'm not sure if the ptach was ready for this
merge window anyway if it has not been in Andrew or Al's trees.

Well... I can't really find any hole in your code, but I'm not completely
sure I like the design. I have done some thinking about the problem
when working on fsblock.

This is adding a whole new layer of synchronisation, which isn't exactly
trivial. What I had been thinking about is doing just page based
synchronisation. Now that page_mkwrite has been changed to allow page
lock held, I think it can work cleanly from the vm/pagecache perspective.

The biggest problem I ran into is the awful structuring of truncate from
below the vfs (so I gave up then).

I have been working to clean up and complete (at least to an RFC stage)
patches to improve this, at which point, doing the page_mkclean thing
on the last partial page should be quite trivial I think.

Basically the problems is that i_op->truncate a) cannot return an error
(which is causing problems missing -EIO today anyway), and b) is called
after i_size update which makes it not possible to error-out without
races anyway, and c) does not get the old i_size so you can't unmap the
last partial page with it.

My patch is basically moving ->truncate call into setattr, and have
the filesystem call vmtruncate. I've jt to clean up loose ends.

Now I may be speaking too soon. It might trun out that my fix is
complex as well, but let me just give you an RFC and we can discuss.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 02/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize
  2009-06-25 16:17   ` Nick Piggin
@ 2009-06-25 16:43     ` Nick Piggin
  2009-06-25 17:47     ` Christoph Hellwig
  2009-06-26 12:21     ` Jan Kara
  2 siblings, 0 replies; 39+ messages in thread
From: Nick Piggin @ 2009-06-25 16:43 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, linux-mm, linux-fsdevel

On Thu, Jun 25, 2009 at 06:17:53PM +0200, Nick Piggin wrote:
> On Mon, Jun 15, 2009 at 07:59:49PM +0200, Jan Kara wrote:
> My patch is basically moving ->truncate call into setattr, and have
> the filesystem call vmtruncate. I've jt to clean up loose ends.
> 
> Now I may be speaking too soon. It might trun out that my fix is
> complex as well, but let me just give you an RFC and we can discuss.

Well this is the basics of the truncate rearrangement, obviously
not finished but it should give the fs enough flexiblity to DTRT
to fix the mapped partial page problem I think.

I'll followup with a patch for that if I can make it work.

---
 fs/attr.c          |   47 +++++++++++++++++++++++++++++++++++++++++------
 fs/buffer.c        |    6 ++++++
 fs/ext2/ext2.h     |    2 +-
 fs/ext2/inode.c    |   38 +++++++++++++++++++++++---------------
 fs/libfs.c         |   13 +++++++++++++
 include/linux/fs.h |    4 +++-
 include/linux/mm.h |    2 +-
 mm/filemap_xip.c   |    9 ++++++---
 mm/memory.c        |   25 +------------------------
 mm/shmem.c         |   12 +++++++++---
 10 files changed, 104 insertions(+), 54 deletions(-)

Index: linux-2.6/fs/attr.c
===================================================================
--- linux-2.6.orig/fs/attr.c
+++ linux-2.6/fs/attr.c
@@ -60,18 +60,53 @@ fine:
 error:
 	return retval;
 }
-
 EXPORT_SYMBOL(inode_change_ok);
 
+int inode_truncate_ok(struct inode *inode, loff_t offset)
+{
+	if (inode->i_size < offset) {
+		unsigned long limit;
+
+		limit = current->signal->rlim[RLIMIT_FSIZE].rlim_cur;
+		if (limit != RLIM_INFINITY && offset > limit)
+			goto out_sig;
+		if (offset > inode->i_sb->s_maxbytes)
+			goto out_big;
+	} else {
+		/*
+		 * truncation of in-use swapfiles is disallowed - it would
+		 * cause subsequent swapout to scribble on the now-freed
+		 * blocks.
+		 */
+		if (IS_SWAPFILE(inode))
+			return -ETXTBSY;
+	}
+
+	return 0;
+out_sig:
+	send_sig(SIGXFSZ, current, 0);
+out_big:
+	return -EFBIG;
+}
+EXPORT_SYMBOL(inode_truncate_ok);
+
 int inode_setattr(struct inode * inode, struct iattr * attr)
 {
 	unsigned int ia_valid = attr->ia_valid;
 
-	if (ia_valid & ATTR_SIZE &&
-	    attr->ia_size != i_size_read(inode)) {
-		int error = vmtruncate(inode, attr->ia_size);
-		if (error)
-			return error;
+	if (ia_valid & ATTR_SIZE) {
+		loff_t offset = attr->ia_size;
+
+		if (offset != inode->i_size) {
+			int error;
+
+			if (inode->i_op->truncate)
+				error = inode->i_op->truncate(inode, offset);
+			else
+				error = simple_truncate(inode, offset);
+			if (error)
+				return error;
+		}
 	}
 
 	if (ia_valid & ATTR_UID)
Index: linux-2.6/fs/ext2/inode.c
===================================================================
--- linux-2.6.orig/fs/ext2/inode.c
+++ linux-2.6/fs/ext2/inode.c
@@ -68,7 +68,7 @@ void ext2_delete_inode (struct inode * i
 
 	inode->i_size = 0;
 	if (inode->i_blocks)
-		ext2_truncate (inode);
+		ext2_truncate(inode, 0);
 	ext2_free_inode (inode);
 
 	return;
@@ -1020,7 +1020,7 @@ static void ext2_free_branches(struct in
 		ext2_free_data(inode, p, q);
 }
 
-void ext2_truncate(struct inode *inode)
+int ext2_truncate(struct inode *inode, loff_t offset)
 {
 	__le32 *i_data = EXT2_I(inode)->i_data;
 	struct ext2_inode_info *ei = EXT2_I(inode);
@@ -1032,31 +1032,37 @@ void ext2_truncate(struct inode *inode)
 	int n;
 	long iblock;
 	unsigned blocksize;
+	int error;
+
+	error = inode_truncate_ok(inode, offset);
+	if (error)
+		return error;
 
 	if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
 	    S_ISLNK(inode->i_mode)))
-		return;
+		return -EINVAL;
 	if (ext2_inode_is_fast_symlink(inode))
-		return;
+		return -EINVAL;
 	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
-		return;
-
-	blocksize = inode->i_sb->s_blocksize;
-	iblock = (inode->i_size + blocksize-1)
-					>> EXT2_BLOCK_SIZE_BITS(inode->i_sb);
+		return -EPERM;
 
 	if (mapping_is_xip(inode->i_mapping))
-		xip_truncate_page(inode->i_mapping, inode->i_size);
+		error = xip_truncate_page(inode->i_mapping, offset);
 	else if (test_opt(inode->i_sb, NOBH))
-		nobh_truncate_page(inode->i_mapping,
-				inode->i_size, ext2_get_block);
+		error = nobh_truncate_page(inode->i_mapping,
+				offset, ext2_get_block);
 	else
-		block_truncate_page(inode->i_mapping,
-				inode->i_size, ext2_get_block);
+		error = block_truncate_page(inode->i_mapping,
+				offset, ext2_get_block);
+	if (error)
+		return error;
+
+	blocksize = inode->i_sb->s_blocksize;
+	iblock = (offset + blocksize-1) >> EXT2_BLOCK_SIZE_BITS(inode->i_sb);
 
 	n = ext2_block_to_path(inode, iblock, offsets, NULL);
 	if (n == 0)
-		return;
+		return 0;
 
 	/*
 	 * From here we block out all ext2_get_block() callers who want to
@@ -1127,6 +1133,8 @@ do_indirects:
 	} else {
 		mark_inode_dirty(inode);
 	}
+
+	return 0;
 }
 
 static struct ext2_inode *ext2_get_inode(struct super_block *sb, ino_t ino,
Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -329,6 +329,18 @@ int simple_rename(struct inode *old_dir,
 	return 0;
 }
 
+int simple_truncate(struct inode * inode, loff_t offset)
+{
+	int error;
+
+	error = inode_truncate_ok(inode, offset);
+	if (error)
+		return error;
+	vmtruncate(inode, offset);
+
+	return error;
+}
+
 int simple_readpage(struct file *file, struct page *page)
 {
 	clear_highpage(page);
@@ -840,6 +852,7 @@ EXPORT_SYMBOL(generic_read_dir);
 EXPORT_SYMBOL(get_sb_pseudo);
 EXPORT_SYMBOL(simple_write_begin);
 EXPORT_SYMBOL(simple_write_end);
+EXPORT_SYMBOL(simple_truncate);
 EXPORT_SYMBOL(simple_dir_inode_operations);
 EXPORT_SYMBOL(simple_dir_operations);
 EXPORT_SYMBOL(simple_empty);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -1526,7 +1526,7 @@ struct inode_operations {
 	int (*readlink) (struct dentry *, char __user *,int);
 	void * (*follow_link) (struct dentry *, struct nameidata *);
 	void (*put_link) (struct dentry *, struct nameidata *, void *);
-	void (*truncate) (struct inode *);
+	int (*truncate) (struct inode *, loff_t);
 	int (*permission) (struct inode *, int);
 	int (*setattr) (struct dentry *, struct iattr *);
 	int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
@@ -2332,6 +2332,7 @@ extern int simple_link(struct dentry *,
 extern int simple_unlink(struct inode *, struct dentry *);
 extern int simple_rmdir(struct inode *, struct dentry *);
 extern int simple_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
+extern int simple_truncate(struct inode *inode, loff_t offset);
 extern int simple_sync_file(struct file *, struct dentry *, int);
 extern int simple_empty(struct dentry *);
 extern int simple_readpage(struct file *file, struct page *page);
@@ -2367,6 +2368,7 @@ extern int buffer_migrate_page(struct ad
 #endif
 
 extern int inode_change_ok(struct inode *, struct iattr *);
+extern int inode_truncate_ok(struct inode *, loff_t offset);
 extern int __must_check inode_setattr(struct inode *, struct iattr *);
 
 extern void file_update_time(struct file *file);
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -805,7 +805,7 @@ static inline void unmap_shared_mapping_
 	unmap_mapping_range(mapping, holebegin, holelen, 0);
 }
 
-extern int vmtruncate(struct inode * inode, loff_t offset);
+extern void vmtruncate(struct inode * inode, loff_t offset);
 extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end);
 
 #ifdef CONFIG_MMU
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -2420,27 +2420,13 @@ EXPORT_SYMBOL(unmap_mapping_range);
  * between the file and the memory map for a potential last
  * incomplete page.  Ugly, but necessary.
  */
-int vmtruncate(struct inode * inode, loff_t offset)
+void vmtruncate(struct inode * inode, loff_t offset)
 {
 	if (inode->i_size < offset) {
-		unsigned long limit;
-
-		limit = current->signal->rlim[RLIMIT_FSIZE].rlim_cur;
-		if (limit != RLIM_INFINITY && offset > limit)
-			goto out_sig;
-		if (offset > inode->i_sb->s_maxbytes)
-			goto out_big;
 		i_size_write(inode, offset);
 	} else {
 		struct address_space *mapping = inode->i_mapping;
 
-		/*
-		 * truncation of in-use swapfiles is disallowed - it would
-		 * cause subsequent swapout to scribble on the now-freed
-		 * blocks.
-		 */
-		if (IS_SWAPFILE(inode))
-			return -ETXTBSY;
 		i_size_write(inode, offset);
 
 		/*
@@ -2456,15 +2442,6 @@ int vmtruncate(struct inode * inode, lof
 		truncate_inode_pages(mapping, offset);
 		unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
 	}
-
-	if (inode->i_op->truncate)
-		inode->i_op->truncate(inode);
-	return 0;
-
-out_sig:
-	send_sig(SIGXFSZ, current, 0);
-out_big:
-	return -EFBIG;
 }
 EXPORT_SYMBOL(vmtruncate);
 
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -763,9 +763,15 @@ done2:
 	}
 }
 
-static void shmem_truncate(struct inode *inode)
+static int shmem_truncate(struct inode *inode, loff_t offset)
 {
-	shmem_truncate_range(inode, inode->i_size, (loff_t)-1);
+	int error;
+	error = inode_truncate_ok(inode, offset);
+	if (error)
+		return error;
+	vmtruncate(inode, offset);
+	shmem_truncate_range(inode, offset, (loff_t)-1);
+	return error;
 }
 
 static int shmem_notify_change(struct dentry *dentry, struct iattr *attr)
@@ -826,7 +832,7 @@ static void shmem_delete_inode(struct in
 		truncate_inode_pages(inode->i_mapping, 0);
 		shmem_unacct_size(info->flags, inode->i_size);
 		inode->i_size = 0;
-		shmem_truncate(inode);
+		shmem_truncate(inode, 0);
 		if (!list_empty(&info->swaplist)) {
 			mutex_lock(&shmem_swaplist_mutex);
 			list_del_init(&info->swaplist);
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2768,6 +2768,9 @@ unlock:
 	unlock_page(page);
 	page_cache_release(page);
 out:
+	if (!err)
+		vmtruncate(inode, from);
+
 	return err;
 }
 EXPORT_SYMBOL(nobh_truncate_page);
@@ -2844,6 +2847,9 @@ unlock:
 	unlock_page(page);
 	page_cache_release(page);
 out:
+	if (!err)
+		vmtruncate(inode, from);
+
 	return err;
 }
 
Index: linux-2.6/fs/ext2/ext2.h
===================================================================
--- linux-2.6.orig/fs/ext2/ext2.h
+++ linux-2.6/fs/ext2/ext2.h
@@ -122,7 +122,7 @@ extern int ext2_write_inode (struct inod
 extern void ext2_delete_inode (struct inode *);
 extern int ext2_sync_inode (struct inode *);
 extern int ext2_get_block(struct inode *, sector_t, struct buffer_head *, int);
-extern void ext2_truncate (struct inode *);
+extern int ext2_truncate (struct inode *, loff_t);
 extern int ext2_setattr (struct dentry *, struct iattr *);
 extern void ext2_set_inode_flags(struct inode *inode);
 extern void ext2_get_inode_flags(struct ext2_inode_info *);
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c
+++ linux-2.6/mm/filemap_xip.c
@@ -440,6 +440,7 @@ EXPORT_SYMBOL_GPL(xip_file_write);
 int
 xip_truncate_page(struct address_space *mapping, loff_t from)
 {
+	struct inode *inode = mapping->host;
 	pgoff_t index = from >> PAGE_CACHE_SHIFT;
 	unsigned offset = from & (PAGE_CACHE_SIZE-1);
 	unsigned blocksize;
@@ -450,12 +451,12 @@ xip_truncate_page(struct address_space *
 
 	BUG_ON(!mapping->a_ops->get_xip_mem);
 
-	blocksize = 1 << mapping->host->i_blkbits;
+	blocksize = 1 << inode->i_blkbits;
 	length = offset & (blocksize - 1);
 
 	/* Block boundary? Nothing to do */
 	if (!length)
-		return 0;
+		goto out;
 
 	length = blocksize - length;
 
@@ -464,11 +465,13 @@ xip_truncate_page(struct address_space *
 	if (unlikely(err)) {
 		if (err == -ENODATA)
 			/* Hole? No need to truncate */
-			return 0;
+			goto out;
 		else
 			return err;
 	}
 	memset(xip_mem + offset, 0, length);
+out:
+	vmtruncate(host, from);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(xip_truncate_page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 02/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize
  2009-06-25 16:17   ` Nick Piggin
  2009-06-25 16:43     ` Nick Piggin
@ 2009-06-25 17:47     ` Christoph Hellwig
  2009-06-26  8:42       ` Nick Piggin
  2009-06-26 12:21     ` Jan Kara
  2 siblings, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2009-06-25 17:47 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Jan Kara, LKML, linux-mm, linux-fsdevel

On Thu, Jun 25, 2009 at 06:17:53PM +0200, Nick Piggin wrote:
> Basically the problems is that i_op->truncate a) cannot return an error
> (which is causing problems missing -EIO today anyway), and b) is called
> after i_size update which makes it not possible to error-out without
> races anyway, and c) does not get the old i_size so you can't unmap the
> last partial page with it.
> 
> My patch is basically moving ->truncate call into setattr, and have
> the filesystem call vmtruncate. I've jt to clean up loose ends.

We absolutely need to get rid of ->truncate.  Due to the above issues
XFS already does the majority of the truncate work from setattr, and it
works pretty well.  The only problem is the generic aops calling
vmtruncate directly.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 02/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize
  2009-06-25 17:47     ` Christoph Hellwig
@ 2009-06-26  8:42       ` Nick Piggin
  2009-06-30 17:37         ` Christoph Hellwig
  0 siblings, 1 reply; 39+ messages in thread
From: Nick Piggin @ 2009-06-26  8:42 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, LKML, linux-mm, linux-fsdevel

On Thu, Jun 25, 2009 at 01:47:55PM -0400, Christoph Hellwig wrote:
> On Thu, Jun 25, 2009 at 06:17:53PM +0200, Nick Piggin wrote:
> > Basically the problems is that i_op->truncate a) cannot return an error
> > (which is causing problems missing -EIO today anyway), and b) is called
> > after i_size update which makes it not possible to error-out without
> > races anyway, and c) does not get the old i_size so you can't unmap the
> > last partial page with it.
> > 
> > My patch is basically moving ->truncate call into setattr, and have
> > the filesystem call vmtruncate. I've jt to clean up loose ends.
> 
> We absolutely need to get rid of ->truncate.  Due to the above issues
> XFS already does the majority of the truncate work from setattr, and it
> works pretty well.

Yes well we could get rid of ->truncate and have filesystems do it
themselves in setattr, but I figure that moving truncate into
generic setattr is helpful (makes conversions a bit easier too).
Did you see my patch? What do you think of that basic approach?


>  The only problem is the generic aops calling
> vmtruncate directly.

What should be done is require that filesystems trim blocks past
i_size in case of any errors. I actually need to fix up a few
existing bugs in this area too, so I'll look at this..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 02/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize
  2009-06-25 16:17   ` Nick Piggin
  2009-06-25 16:43     ` Nick Piggin
  2009-06-25 17:47     ` Christoph Hellwig
@ 2009-06-26 12:21     ` Jan Kara
  2009-06-26 12:55       ` Nick Piggin
  2 siblings, 1 reply; 39+ messages in thread
From: Jan Kara @ 2009-06-26 12:21 UTC (permalink / raw)
  To: Nick Piggin; +Cc: LKML, linux-mm, linux-fsdevel

  Hi Nick,

On Thu 25-06-09 18:17:53, Nick Piggin wrote:
> On Mon, Jun 15, 2009 at 07:59:49PM +0200, Jan Kara wrote:
> > page_mkwrite() is meant to be used by filesystems to allocate blocks under a
> > page which is becoming writeably mmapped in some process address space. This
> > allows a filesystem to return a page fault if there is not enough space
> > available, user exceeds quota or similar problem happens, rather than silently
> > discarding data later when writepage is called.
> > 
> > On filesystems where blocksize < pagesize the situation is more complicated.
> > Think for example that blocksize = 1024, pagesize = 4096 and a process does:
> >   ftruncate(fd, 0);
> >   pwrite(fd, buf, 1024, 0);
> >   map = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0);
> >   map[0] = 'a';  ----> page_mkwrite() for index 0 is called
> >   ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
> >   fsync(fd); ----> writepage() for index 0 is called
> > 
> > At the moment page_mkwrite() is called, filesystem can allocate only one block
> > for the page because i_size == 1024. Otherwise it would create blocks beyond
> > i_size which is generally undesirable. But later at writepage() time, we would
> > like to have blocks allocated for the whole page (and in principle we have to
> > allocate them because user could have filled the page with data after the
> > second ftruncate()). This patch introduces a framework which allows filesystems
> > to handle this with a reasonable effort.
> > 
> > The idea is following: Before we extend i_size, we obtain a special lock blocking
> > page_mkwrite() on the page straddling i_size. Then we writeprotect the page,
> > change i_size and unlock the special lock. This way, page_mkwrite() is called for
> > a page each time a number of blocks needed to be allocated for a page increases.
> 
> 
> Sorry for late reply here, I'm not sure if the ptach was ready for this
> merge window anyway if it has not been in Andrew or Al's trees.
> 
> Well... I can't really find any hole in your code, but I'm not completely
> sure I like the design. I have done some thinking about the problem
> when working on fsblock.
> 
> This is adding a whole new layer of synchronisation, which isn't exactly
> trivial. What I had been thinking about is doing just page based
> synchronisation. Now that page_mkwrite has been changed to allow page
> lock held, I think it can work cleanly from the vm/pagecache perspective.
  Yes, I agree the solution isn't elegant. But I couldn't come up with
anything better ;).

> The biggest problem I ran into is the awful structuring of truncate from
> below the vfs (so I gave up then).
> 
> I have been working to clean up and complete (at least to an RFC stage)
> patches to improve this, at which point, doing the page_mkclean thing
> on the last partial page should be quite trivial I think.
> 
> Basically the problems is that i_op->truncate a) cannot return an error
> (which is causing problems missing -EIO today anyway), and b) is called
> after i_size update which makes it not possible to error-out without
> races anyway, and c) does not get the old i_size so you can't unmap the
> last partial page with it.
> 
> My patch is basically moving ->truncate call into setattr, and have
> the filesystem call vmtruncate. I've jt to clean up loose ends.
> 
> Now I may be speaking too soon. It might trun out that my fix is
> complex as well, but let me just give you an RFC and we can discuss.
  I've looked at your patch and it's definitely a worthwhile cleanup.
But it's not quite enough for what I need. I'll try to describe the issue
as I'm aware of it and possible solutions and maybe you'll have some idea
about a better solution.
  PROBLEM: We have a file 'f' of length OLD_ISIZE mmapped upto some offset
OFF (multiple of pagesize). The problem generally happens when OLD_ISIZE <
OFF and subsequent filesystem operation (it need not be just truncate() but
also a plain write() creating a hole - this is the main reason why the
patch is much more complicated than I'd like) increases file size to
NEW_ISIZE.
  The first decision: mmap() documentation says: "The effect of changing
the size of the underlying file of a mapping on the pages that correspond
to added or removed regions of the file is  unspecified." So according to
this it would be perfectly fine if we just discarded all the data beyond
OLD_ISIZE written via that particular mmap(). It would be even technically
doable - vma would store minimum i_size which was ever seen at page_mkwrite
time, page_mkwrite will allocate buffers only upto vma->i_size, we silently
discard all unmapped dirty buffers. But I also see two problems with this:
  1) So far we were much nicer to the user and when the file size has been
increased, user could happily write data via old mmap upto new file size.
I'd almost bet some people rely on this especially in the case
truncate-down, truncate-up, write via mmap.
  2) It's kind of fragile to discard dirty unmapped buffers without a
warning.

  So I took the decision that data written via mmap() should be stored
on disk properly if they were written inside i_size at the time the write
via mmap() happened. This decision basically dictates, that you have to do
some magic for the page containing OLD_ISIZE at the time file size is going
to be extended. All kinds of magic I could think of required taking
PageLock of the page containing OLD_ISIZE and that makes locking for the
write case interesting:
  1) write creating a hole has to update i_size inside PageLock for the
page it writes to (to avoid races with block_write_full_page()).
  2) we have to take PageLock for the page containing OLD_ISIZE sometime
before i_size gets updated to do our magic -> this has to happen in
write_begin().

  Now about the magic: There are two things I could imagine we do:
a) when the page containing OLD_ISIZE is dirty, try to allocate blocks
under it as needed for NEW_ISIZE - but the errors when this fails will
return to the process doing truncate / creating hole with write which is
kind of strange. Also we maybe allocate blocks unnecessarily because user
may never actually write to the page containing OLD_ISIZE again...
b) we writeout the page, writeprotect it and let page_mkwrite() do it's
work when it's called again. This is nice from the theoretical POV but
gets a bit messy - we have to make sure page_mkwrite() for that page
doesn't proceed until we update the i_size. Which is why I introduced that
inode bit-lock. We cannot use e.g. i_mutex for that because it ranks above
mmap_sem which is held when we enter page_mkwrite.

  So if you have any idea how to better solve this, you are welcome ;).

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 02/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize
  2009-06-26 12:21     ` Jan Kara
@ 2009-06-26 12:55       ` Nick Piggin
  2009-06-26 16:08         ` Jan Kara
  0 siblings, 1 reply; 39+ messages in thread
From: Nick Piggin @ 2009-06-26 12:55 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, linux-mm, linux-fsdevel

On Fri, Jun 26, 2009 at 02:21:41PM +0200, Jan Kara wrote:
> > Now I may be speaking too soon. It might trun out that my fix is
> > complex as well, but let me just give you an RFC and we can discuss.
>   I've looked at your patch and it's definitely a worthwhile cleanup.
> But it's not quite enough for what I need. I'll try to describe the issue
> as I'm aware of it and possible solutions and maybe you'll have some idea
> about a better solution.
>   PROBLEM: We have a file 'f' of length OLD_ISIZE mmapped upto some offset
> OFF (multiple of pagesize). The problem generally happens when OLD_ISIZE <
> OFF and subsequent filesystem operation (it need not be just truncate() but
> also a plain write() creating a hole - this is the main reason why the
> patch is much more complicated than I'd like) increases file size to
> NEW_ISIZE.
>   The first decision: mmap() documentation says: "The effect of changing
> the size of the underlying file of a mapping on the pages that correspond
> to added or removed regions of the file is  unspecified." So according to
> this it would be perfectly fine if we just discarded all the data beyond
> OLD_ISIZE written via that particular mmap(). It would be even technically
> doable - vma would store minimum i_size which was ever seen at page_mkwrite
> time, page_mkwrite will allocate buffers only upto vma->i_size, we silently
> discard all unmapped dirty buffers. But I also see two problems with this:
>   1) So far we were much nicer to the user and when the file size has been
> increased, user could happily write data via old mmap upto new file size.
> I'd almost bet some people rely on this especially in the case
> truncate-down, truncate-up, write via mmap.
>   2) It's kind of fragile to discard dirty unmapped buffers without a
> warning.
> 
>   So I took the decision that data written via mmap() should be stored
> on disk properly if they were written inside i_size at the time the write
> via mmap() happened. This decision basically dictates, that you have to do
> some magic for the page containing OLD_ISIZE at the time file size is going
> to be extended. All kinds of magic I could think of required taking
> PageLock of the page containing OLD_ISIZE and that makes locking for the
> write case interesting:
>   1) write creating a hole has to update i_size inside PageLock for the
> page it writes to (to avoid races with block_write_full_page()).
>   2) we have to take PageLock for the page containing OLD_ISIZE sometime
> before i_size gets updated to do our magic -> this has to happen in
> write_begin().
> 
>   Now about the magic: There are two things I could imagine we do:
> a) when the page containing OLD_ISIZE is dirty, try to allocate blocks
> under it as needed for NEW_ISIZE - but the errors when this fails will
> return to the process doing truncate / creating hole with write which is
> kind of strange. Also we maybe allocate blocks unnecessarily because user
> may never actually write to the page containing OLD_ISIZE again...
> b) we writeout the page, writeprotect it and let page_mkwrite() do it's
> work when it's called again. This is nice from the theoretical POV but
> gets a bit messy - we have to make sure page_mkwrite() for that page
> doesn't proceed until we update the i_size. Which is why I introduced that
> inode bit-lock. We cannot use e.g. i_mutex for that because it ranks above
> mmap_sem which is held when we enter page_mkwrite.
> 
>   So if you have any idea how to better solve this, you are welcome ;).

Ah thanks, the write(2) case I missed. That does get complex to
do with the page lock.

I agree with the semantics you are aiming for, and I agree we should
not try to allocate blocks when extending i_size.

We actually could update i_size after dropping the page lock in
these paths. That would give a window where we can page_mkclean
the old partial page before the i_size update.

However this does actually require that we remove the partial-page
zeroing that writepage does. I think it does it in order to attempt
to write zeroes into the fs even if the app does mmaped writes
past i_size... but it is pretty dumb anyway really because the
behaviour is undefined anyway so there is no problem if weird
stuff gets written there (it should be zeroed out when extending
the file anyway), and also there is nothing to prevent races of
subsequent mmapped writes before the DMA completes.

But once we remove that, then we can solve it with the page lock
I think.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 02/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize
  2009-06-26 12:55       ` Nick Piggin
@ 2009-06-26 16:08         ` Jan Kara
  2009-06-29  5:54           ` Nick Piggin
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Kara @ 2009-06-26 16:08 UTC (permalink / raw)
  To: Nick Piggin; +Cc: LKML, linux-mm, linux-fsdevel

On Fri 26-06-09 14:55:05, Nick Piggin wrote:
> On Fri, Jun 26, 2009 at 02:21:41PM +0200, Jan Kara wrote:
> > > Now I may be speaking too soon. It might trun out that my fix is
> > > complex as well, but let me just give you an RFC and we can discuss.
> >   I've looked at your patch and it's definitely a worthwhile cleanup.
> > But it's not quite enough for what I need. I'll try to describe the issue
> > as I'm aware of it and possible solutions and maybe you'll have some idea
> > about a better solution.
> >   PROBLEM: We have a file 'f' of length OLD_ISIZE mmapped upto some offset
> > OFF (multiple of pagesize). The problem generally happens when OLD_ISIZE <
> > OFF and subsequent filesystem operation (it need not be just truncate() but
> > also a plain write() creating a hole - this is the main reason why the
> > patch is much more complicated than I'd like) increases file size to
> > NEW_ISIZE.
> >   The first decision: mmap() documentation says: "The effect of changing
> > the size of the underlying file of a mapping on the pages that correspond
> > to added or removed regions of the file is  unspecified." So according to
> > this it would be perfectly fine if we just discarded all the data beyond
> > OLD_ISIZE written via that particular mmap(). It would be even technically
> > doable - vma would store minimum i_size which was ever seen at page_mkwrite
> > time, page_mkwrite will allocate buffers only upto vma->i_size, we silently
> > discard all unmapped dirty buffers. But I also see two problems with this:
> >   1) So far we were much nicer to the user and when the file size has been
> > increased, user could happily write data via old mmap upto new file size.
> > I'd almost bet some people rely on this especially in the case
> > truncate-down, truncate-up, write via mmap.
> >   2) It's kind of fragile to discard dirty unmapped buffers without a
> > warning.
> > 
> >   So I took the decision that data written via mmap() should be stored
> > on disk properly if they were written inside i_size at the time the write
> > via mmap() happened. This decision basically dictates, that you have to do
> > some magic for the page containing OLD_ISIZE at the time file size is going
> > to be extended. All kinds of magic I could think of required taking
> > PageLock of the page containing OLD_ISIZE and that makes locking for the
> > write case interesting:
> >   1) write creating a hole has to update i_size inside PageLock for the
> > page it writes to (to avoid races with block_write_full_page()).
> >   2) we have to take PageLock for the page containing OLD_ISIZE sometime
> > before i_size gets updated to do our magic -> this has to happen in
> > write_begin().
> > 
> >   Now about the magic: There are two things I could imagine we do:
> > a) when the page containing OLD_ISIZE is dirty, try to allocate blocks
> > under it as needed for NEW_ISIZE - but the errors when this fails will
> > return to the process doing truncate / creating hole with write which is
> > kind of strange. Also we maybe allocate blocks unnecessarily because user
> > may never actually write to the page containing OLD_ISIZE again...
> > b) we writeout the page, writeprotect it and let page_mkwrite() do it's
> > work when it's called again. This is nice from the theoretical POV but
> > gets a bit messy - we have to make sure page_mkwrite() for that page
> > doesn't proceed until we update the i_size. Which is why I introduced that
> > inode bit-lock. We cannot use e.g. i_mutex for that because it ranks above
> > mmap_sem which is held when we enter page_mkwrite.
> > 
> >   So if you have any idea how to better solve this, you are welcome ;).
> 
> Ah thanks, the write(2) case I missed. That does get complex to
> do with the page lock.
> 
> I agree with the semantics you are aiming for, and I agree we should
> not try to allocate blocks when extending i_size.
> 
> We actually could update i_size after dropping the page lock in
> these paths. That would give a window where we can page_mkclean
> the old partial page before the i_size update.
  Yes, that would be fine and make things simpler...

> However this does actually require that we remove the partial-page
> zeroing that writepage does. I think it does it in order to attempt
> to write zeroes into the fs even if the app does mmaped writes
> past i_size... but it is pretty dumb anyway really because the
> behaviour is undefined anyway so there is no problem if weird
> stuff gets written there (it should be zeroed out when extending
> the file anyway), and also there is nothing to prevent races of
> subsequent mmapped writes before the DMA completes.
  We definitely don't zero out the last page when extending the file. But
if we do it, we should be fine as you write. I'll try to write a patch...
(I'm on vacation next week though so probably after that).

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 02/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize
  2009-06-26 16:08         ` Jan Kara
@ 2009-06-29  5:54           ` Nick Piggin
  0 siblings, 0 replies; 39+ messages in thread
From: Nick Piggin @ 2009-06-29  5:54 UTC (permalink / raw)
  To: Jan Kara, OM; +Cc: LKML, linux-mm, linux-fsdevel

On Fri, Jun 26, 2009 at 06:08:51PM +0200, Jan Kara wrote:
> On Fri 26-06-09 14:55:05, Nick Piggin wrote:
> > On Fri, Jun 26, 2009 at 02:21:41PM +0200, Jan Kara wrote:
> > >   So if you have any idea how to better solve this, you are welcome ;).
> > 
> > Ah thanks, the write(2) case I missed. That does get complex to
> > do with the page lock.
> > 
> > I agree with the semantics you are aiming for, and I agree we should
> > not try to allocate blocks when extending i_size.
> > 
> > We actually could update i_size after dropping the page lock in
> > these paths. That would give a window where we can page_mkclean
> > the old partial page before the i_size update.
>   Yes, that would be fine and make things simpler...

Hopefully.

 
> > However this does actually require that we remove the partial-page
> > zeroing that writepage does. I think it does it in order to attempt
> > to write zeroes into the fs even if the app does mmaped writes
> > past i_size... but it is pretty dumb anyway really because the
> > behaviour is undefined anyway so there is no problem if weird
> > stuff gets written there (it should be zeroed out when extending
> > the file anyway), and also there is nothing to prevent races of
> > subsequent mmapped writes before the DMA completes.
>   We definitely don't zero out the last page when extending the file. But
> if we do it, we should be fine as you write. I'll try to write a patch...
> (I'm on vacation next week though so probably after that).

What I mean is that as of today, write(2) is required to hold page lock
of the page it is operating on if it writes anything past i_size. It
must hold that lock until i_size is extended to include the new data.
If it does not hold the lock, then eg. block_write_full_page can zero
out that data incorrectly

        /*
         * The page straddles i_size.  It must be zeroed out on each and every
         * writepage invokation because it may be mmapped.  "A file is mapped
         * in multiples of the page size.  For a file that is not a multiple of
         * the  page size, the remaining memory is zeroed when mapped, and
         * writes to that region are not written out to the file."
         */

But I argue this is bogus anyway because it is completely racy, and
it should be undefined behaviour anyway. So I think it would be fine
to remove it.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 02/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize
  2009-06-26  8:42       ` Nick Piggin
@ 2009-06-30 17:37         ` Christoph Hellwig
  2009-07-02  7:22           ` Nick Piggin
  0 siblings, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2009-06-30 17:37 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, Jan Kara, LKML, linux-mm, linux-fsdevel

On Fri, Jun 26, 2009 at 10:42:25AM +0200, Nick Piggin wrote:
> Yes well we could get rid of ->truncate and have filesystems do it
> themselves in setattr, but I figure that moving truncate into
> generic setattr is helpful (makes conversions a bit easier too).
> Did you see my patch? What do you think of that basic approach?

I was waiting for a patch series for your for this to appear, but
noticed that you actually had a small proof of concept patch attached,
sorry :)

Looking at your patch I really like that vmtruncate now really just
does what it's name claims to - truncate the VM-information about
the file (well, and the file size).   I'm not so happy about
still keeping the two level setattr/truncate indirection.

But instead of folding truncate into setattr I wonder if we should
just add a new ->setsize (aka new trunacte) methodas a top-level
entry point instead of ->setattr with ATTR_SIZE given that size
changes don't have much in common with the reset of ->setattr.

The only bit shared is updating c/mtime and even that is conditional.
So I'd say take most of your patch, but instead of doing an all at
once migration migrate filesystems to the new ->setsize callback
incrementally and eventually kill off the old code.  This means
we'll need a new name for the new vmtruncate-lite but should otherwise
be pretty easy.

> >  The only problem is the generic aops calling
> > vmtruncate directly.
> 
> What should be done is require that filesystems trim blocks past
> i_size in case of any errors. I actually need to fix up a few
> existing bugs in this area too, so I'll look at this..

Basically we want ->setattr with ATTR_SIZE, execept that we already
have i_sem and possibly other per-inode locks.  Take a look at

	http://oss.sgi.com/archives/xfs/2008-04/msg00542.html

and

	http://oss.sgi.com/archives/xfs/2009-03/msg00214.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/10] Fix page_mkwrite() for blocksize < pagesize (version 3)
  2009-06-16 14:42   ` Jan Kara
@ 2009-06-30 17:44     ` Christoph Hellwig
  2009-07-01 10:29       ` Aneesh Kumar K.V
  0 siblings, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2009-06-30 17:44 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, LKML, linux-mm, linux-fsdevel, npiggin

On Tue, Jun 16, 2009 at 04:42:17PM +0200, Jan Kara wrote:
>   Good point, I should have mentioned in the changelog: fsx-linux is able
> to trigger the problem quite quickly.
>   I have also written a simple program for initial testing of the fix
> (works only for 1K blocksize and 4K pagesize) - it's attached.

I haven't been able to trigger anything with it on either xfs or ext4.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/10] Fix page_mkwrite() for blocksize < pagesize (version 3)
  2009-06-30 17:44     ` Christoph Hellwig
@ 2009-07-01 10:29       ` Aneesh Kumar K.V
  0 siblings, 0 replies; 39+ messages in thread
From: Aneesh Kumar K.V @ 2009-07-01 10:29 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, LKML, linux-mm, linux-fsdevel, npiggin

On Tue, Jun 30, 2009 at 01:44:19PM -0400, Christoph Hellwig wrote:
> On Tue, Jun 16, 2009 at 04:42:17PM +0200, Jan Kara wrote:
> >   Good point, I should have mentioned in the changelog: fsx-linux is able
> > to trigger the problem quite quickly.
> >   I have also written a simple program for initial testing of the fix
> > (works only for 1K blocksize and 4K pagesize) - it's attached.
> 
> I haven't been able to trigger anything with it on either xfs or ext4.
> 

fsx-linux with ext4 mount option -o nodelalloc.

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 02/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize
  2009-06-30 17:37         ` Christoph Hellwig
@ 2009-07-02  7:22           ` Nick Piggin
  2009-07-04 15:18             ` Christoph Hellwig
  0 siblings, 1 reply; 39+ messages in thread
From: Nick Piggin @ 2009-07-02  7:22 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, LKML, linux-mm, linux-fsdevel

On Tue, Jun 30, 2009 at 01:37:17PM -0400, Christoph Hellwig wrote:
> On Fri, Jun 26, 2009 at 10:42:25AM +0200, Nick Piggin wrote:
> > Yes well we could get rid of ->truncate and have filesystems do it
> > themselves in setattr, but I figure that moving truncate into
> > generic setattr is helpful (makes conversions a bit easier too).
> > Did you see my patch? What do you think of that basic approach?
> 
> I was waiting for a patch series for your for this to appear, but
> noticed that you actually had a small proof of concept patch attached,
> sorry :)

No problem. I'm going to send out a patchset, I've just been working
on exploring different ideas and trying to work out bugs.


> Looking at your patch I really like that vmtruncate now really just
> does what it's name claims to - truncate the VM-information about
> the file (well, and the file size).   I'm not so happy about
> still keeping the two level setattr/truncate indirection.

In my patch series, i_size update eventually is moved out to the
filesystem too, and vmtruncate just is renamed to truncate_pagecache
(vmtruncate is not such a bad name, but rename will nicely break
unconverted modules).

 
> But instead of folding truncate into setattr I wonder if we should
> just add a new ->setsize (aka new trunacte) methodas a top-level
> entry point instead of ->setattr with ATTR_SIZE given that size
> changes don't have much in common with the reset of ->setattr.

OK that would be possible and makes sense I guess. The new truncate
which returns error could basically be renamed in-place. Shall we
continue to give ATTR_SIZE to setattr, or take that out completely?
I guess truncate can be considered special because it operates on
data not only metadata.

Looks like ->setsize would need a flag for ATTR_OPEN too? Any others?
I'll do a bit of an audit when I get around to it...

 
> The only bit shared is updating c/mtime and even that is conditional.
> So I'd say take most of your patch, but instead of doing an all at
> once migration migrate filesystems to the new ->setsize callback
> incrementally and eventually kill off the old code.  This means
> we'll need a new name for the new vmtruncate-lite but should otherwise
> be pretty easy.

Makes sense. I'll try to structure it to allow incremental changeover.

 
> > >  The only problem is the generic aops calling
> > > vmtruncate directly.
> > 
> > What should be done is require that filesystems trim blocks past
> > i_size in case of any errors. I actually need to fix up a few
> > existing bugs in this area too, so I'll look at this..
> 
> Basically we want ->setattr with ATTR_SIZE, execept that we already
> have i_sem and possibly other per-inode locks.  Take a look at
> 
> 	http://oss.sgi.com/archives/xfs/2008-04/msg00542.html
> 
> and
> 
> 	http://oss.sgi.com/archives/xfs/2009-03/msg00214.html

OK thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 02/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize
  2009-07-02  7:22           ` Nick Piggin
@ 2009-07-04 15:18             ` Christoph Hellwig
  2009-07-06  9:08               ` Nick Piggin
  0 siblings, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2009-07-04 15:18 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, Jan Kara, LKML, linux-mm, linux-fsdevel

On Thu, Jul 02, 2009 at 09:22:25AM +0200, Nick Piggin wrote:
> > Looking at your patch I really like that vmtruncate now really just
> > does what it's name claims to - truncate the VM-information about
> > the file (well, and the file size).   I'm not so happy about
> > still keeping the two level setattr/truncate indirection.
> 
> In my patch series, i_size update eventually is moved out to the
> filesystem too, and vmtruncate just is renamed to truncate_pagecache
> (vmtruncate is not such a bad name, but rename will nicely break
> unconverted modules).

Good, that's a much better calling and naming convention.

> > But instead of folding truncate into setattr I wonder if we should
> > just add a new ->setsize (aka new trunacte) methodas a top-level
> > entry point instead of ->setattr with ATTR_SIZE given that size
> > changes don't have much in common with the reset of ->setattr.
> 
> OK that would be possible and makes sense I guess. The new truncate
> which returns error could basically be renamed in-place. Shall we
> continue to give ATTR_SIZE to setattr, or take that out completely?
> I guess truncate can be considered special because it operates on
> data not only metadata.
> 
> Looks like ->setsize would need a flag for ATTR_OPEN too? Any others?
> I'll do a bit of an audit when I get around to it...

In the end ATTR_SIZE should not be passed to ->setattr anymore, and
->setsize should become mandatory.  For the transition I would recommend
calling ->setsize if present else fall back to the current way.  That
way we can migreate one filesystem per patch to the new scheme.

I would suggest giving the flags to ->setsize their own namespace with
two flags so far SETSIZE_FTRUNCATE (need to update the file size and
have a file struct available) and SETSIZE_OPEN for the ATTR_OPEN case.

That beeing said I reallye hate the conditiona file argument for
ftrunctate (currently hidden inside struct iattr), maybe we're better
off having am optional int (*ftruncate)(struct file *) method for those
filesystems that need it, with a fallback to ->setsize.

And yeah, maybe ->setsize might better be left as ->truncate, but if
we want a nicely bisectable migration we'd have to rename the old
truncate to e.g. ->old_truncate before.  That's probably worth having
the better naming in the end.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 02/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize
  2009-07-04 15:18             ` Christoph Hellwig
@ 2009-07-06  9:08               ` Nick Piggin
  2009-07-06 10:35                 ` Christoph Hellwig
  0 siblings, 1 reply; 39+ messages in thread
From: Nick Piggin @ 2009-07-06  9:08 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, LKML, linux-mm, linux-fsdevel

On Sat, Jul 04, 2009 at 11:18:01AM -0400, Christoph Hellwig wrote:
> On Thu, Jul 02, 2009 at 09:22:25AM +0200, Nick Piggin wrote:
> > I guess truncate can be considered special because it operates on
> > data not only metadata.
> > 
> > Looks like ->setsize would need a flag for ATTR_OPEN too? Any others?
> > I'll do a bit of an audit when I get around to it...
> 
> In the end ATTR_SIZE should not be passed to ->setattr anymore, and
> ->setsize should become mandatory.  For the transition I would recommend
> calling ->setsize if present else fall back to the current way.  That
> way we can migreate one filesystem per patch to the new scheme.
> 
> I would suggest giving the flags to ->setsize their own namespace with
> two flags so far SETSIZE_FTRUNCATE (need to update the file size and
> have a file struct available) and SETSIZE_OPEN for the ATTR_OPEN case.
> 
> That beeing said I reallye hate the conditiona file argument for
> ftrunctate (currently hidden inside struct iattr), maybe we're better
> off having am optional int (*ftruncate)(struct file *) method for those
> filesystems that need it, with a fallback to ->setsize.

OK, hmm, but I wonder -- most of the time do_truncate will need to
call notify_change anyway, so I wonder if avoiding the double
indirection saves us anything? (It requires 2 indirect calls either
way). And if we call ->setsize from ->setattr, then a filesystem
which implements its own ->setattr could avoid one of those indirect
calls. Not so if do_truncate has to call ->setattr then ->setsize.

We definitely could call the method ->ftruncate, however (regardless
of where we call it from). In fact, we could just have a single new
->ftruncate where struct file * argument is NULL if not called via
an open file. This should also solve the namimg issue without
renaming the old method (having both ->truncate and ->ftruncate
could be slightly confusing at a glance, but we will remove
->truncate ASAP).

Anyway, let me finish the first draft and post my series and we
can go over it further.


> And yeah, maybe ->setsize might better be left as ->truncate, but if
> we want a nicely bisectable migration we'd have to rename the old
> truncate to e.g. ->old_truncate before.  That's probably worth having
> the better naming in the end.

It is definitely better to not break things as my first patch has
done. I think it should not be too hard to have intermediate steps.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 02/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize
  2009-07-06  9:08               ` Nick Piggin
@ 2009-07-06 10:35                 ` Christoph Hellwig
  2009-07-06 11:49                   ` Nick Piggin
  0 siblings, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2009-07-06 10:35 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, Jan Kara, LKML, linux-mm, linux-fsdevel

On Mon, Jul 06, 2009 at 11:08:04AM +0200, Nick Piggin wrote:
> OK, hmm, but I wonder -- most of the time do_truncate will need to
> call notify_change anyway, so I wonder if avoiding the double
> indirection saves us anything? (It requires 2 indirect calls either
> way). And if we call ->setsize from ->setattr, then a filesystem
> which implements its own ->setattr could avoid one of those indirect
> calls. Not so if do_truncate has to call ->setattr then ->setsize.

I don't quite understand what you mean here. In the end there should
be one single indirect call, ->setsize (or whatever it's called by
then).

In the first round we'd split up a helper just for size updates from
notify_change, ala:

int vfs_truncate(struct dentry *dentry, loff_t size, int flags, file)
{
	int error;

	error = security_inode_truncate(dentry, size, flags, file);
	if (error)
		return error;

	if (inode->i_op->setsize) {
		inode->i_op->setsize(dentry, size, flags, file);

	} else {
		<... built up iattr here ...>

		if (inode->i_op->setattr) {	
			down_write(&dentry->d_inode->i_alloc_sem);
			error = inode->i_op->setattr(dentry, attr);
			up_write(&dentry->d_inode->i_alloc_sem);
		} else {
			down_write(&dentry->d_inode->i_alloc_sem);
			error = inode_setattr(inode, attr);
			up_write(&dentry->d_inode->i_alloc_sem);
		}
	}

	if (!error)
		fsnotify_truncate(dentry, size, flags);
	return error;
}

One all filesistem are converted to have a setsize method (either their
own or simple_setsize) the !inode->i_op->setsize case can go away.

Note that the above variant moves taking i_alloc_sem into ->setsize as
it's not required for most filesystems (I think only extN need for
O_DIRECT).

Also the above doesn't deal with killing the SUID/SGID bits yet, we'll
need some good way for that.

Actually it might be better to just pass the iattr to ->setsize to so
we can have the parsing for those arguments once, and that filesystems
can re-use parts of their ->setattr for ->setsize if it's complex enough
(timestamp updates and suid/sgid killing)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 02/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize
  2009-07-06 10:35                 ` Christoph Hellwig
@ 2009-07-06 11:49                   ` Nick Piggin
  0 siblings, 0 replies; 39+ messages in thread
From: Nick Piggin @ 2009-07-06 11:49 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, LKML, linux-mm, linux-fsdevel

On Mon, Jul 06, 2009 at 06:35:40AM -0400, Christoph Hellwig wrote:
> On Mon, Jul 06, 2009 at 11:08:04AM +0200, Nick Piggin wrote:
> > OK, hmm, but I wonder -- most of the time do_truncate will need to
> > call notify_change anyway, so I wonder if avoiding the double
> > indirection saves us anything? (It requires 2 indirect calls either
> > way). And if we call ->setsize from ->setattr, then a filesystem
> > which implements its own ->setattr could avoid one of those indirect
> > calls. Not so if do_truncate has to call ->setattr then ->setsize.
> 
> I don't quite understand what you mean here. In the end there should
> be one single indirect call, ->setsize (or whatever it's called by
> then).
> 
> In the first round we'd split up a helper just for size updates from
> notify_change, ala:
> 
> int vfs_truncate(struct dentry *dentry, loff_t size, int flags, file)
> {
> 	int error;
> 
> 	error = security_inode_truncate(dentry, size, flags, file);
> 	if (error)
> 		return error;
> 
> 	if (inode->i_op->setsize) {
> 		inode->i_op->setsize(dentry, size, flags, file);
> 
> 	} else {
> 		<... built up iattr here ...>
> 
> 		if (inode->i_op->setattr) {	
> 			down_write(&dentry->d_inode->i_alloc_sem);
> 			error = inode->i_op->setattr(dentry, attr);
> 			up_write(&dentry->d_inode->i_alloc_sem);
> 		} else {
> 			down_write(&dentry->d_inode->i_alloc_sem);
> 			error = inode_setattr(inode, attr);
> 			up_write(&dentry->d_inode->i_alloc_sem);
> 		}
> 	}
> 
> 	if (!error)
> 		fsnotify_truncate(dentry, size, flags);
> 	return error;
> }
> 
> One all filesistem are converted to have a setsize method (either their
> own or simple_setsize) the !inode->i_op->setsize case can go away.
> 
> Note that the above variant moves taking i_alloc_sem into ->setsize as
> it's not required for most filesystems (I think only extN need for
> O_DIRECT).
> 
> Also the above doesn't deal with killing the SUID/SGID bits yet, we'll
> need some good way for that.
> 
> Actually it might be better to just pass the iattr to ->setsize to so
> we can have the parsing for those arguments once, and that filesystems
> can re-use parts of their ->setattr for ->setsize if it's complex enough
> (timestamp updates and suid/sgid killing)

^^^^
Yes this was the problem I was thinking about. Because for exampe
the truncate setattr call is also used for timestamp update as well
as should_remove_setuid. The alternative to both ->setsize and ->setattr
calls here is to reuse some of the ->setattr functionality in ->setsize
as you say.

So it will be a simpler change to call the new ->setsize from
inside ->setattr. I guess that doesn't avoid your i_alloc_sem
probem, but maybe we should move that into implementations if only
a few filesystems require it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2009-07-06 11:49 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-15 17:59 [PATCH 0/10] Fix page_mkwrite() for blocksize < pagesize (version 3) Jan Kara
2009-06-15 17:59 ` [PATCH 01/11] ext3: Get rid of extenddisksize parameter of ext3_get_blocks_handle() Jan Kara
2009-06-17 10:28   ` Nick Piggin
2009-06-17 11:49     ` Jan Kara
2009-06-15 17:59 ` [PATCH 02/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize Jan Kara
2009-06-25 16:17   ` Nick Piggin
2009-06-25 16:43     ` Nick Piggin
2009-06-25 17:47     ` Christoph Hellwig
2009-06-26  8:42       ` Nick Piggin
2009-06-30 17:37         ` Christoph Hellwig
2009-07-02  7:22           ` Nick Piggin
2009-07-04 15:18             ` Christoph Hellwig
2009-07-06  9:08               ` Nick Piggin
2009-07-06 10:35                 ` Christoph Hellwig
2009-07-06 11:49                   ` Nick Piggin
2009-06-26 12:21     ` Jan Kara
2009-06-26 12:55       ` Nick Piggin
2009-06-26 16:08         ` Jan Kara
2009-06-29  5:54           ` Nick Piggin
2009-06-15 17:59 ` [PATCH 03/11] ext2: Allocate space for mmaped file on page fault Jan Kara
2009-06-15 17:59 ` [PATCH 04/11] ext4: Make sure blocks are properly allocated under mmaped page even when blocksize < pagesize Jan Kara
2009-06-15 17:59 ` [PATCH 05/11] ext3: Allocate space for mmaped file on page fault Jan Kara
2009-06-15 17:59 ` [PATCH 06/11] vfs: Implement generic per-cpu counters for delayed allocation Jan Kara
2009-06-15 17:59 ` [PATCH 07/11] vfs: Unmap underlying metadata of new data buffers only when buffer is mapped Jan Kara
2009-06-17 10:35   ` Nick Piggin
2009-06-17 12:05     ` Jan Kara
2009-06-17 13:53       ` Nick Piggin
2009-06-18 12:00         ` Theodore Tso
2009-06-18 11:51   ` OGAWA Hirofumi
2009-06-15 17:59 ` [PATCH 08/11] fs: Don't clear dirty bits in block_write_full_page() Jan Kara
2009-06-15 17:59 ` [PATCH 09/11] vfs: Export wakeup_pdflush Jan Kara
2009-06-15 17:59 ` [PATCH 10/11] ext3: Implement delayed allocation on page_mkwrite time Jan Kara
2009-06-15 18:02 ` [PATCH 0/10] Fix page_mkwrite() for blocksize < pagesize (version 3) Jan Kara
2009-06-15 18:17 ` Aneesh Kumar K.V
2009-06-16 10:28   ` Jan Kara
2009-06-16 14:34 ` Christoph Hellwig
2009-06-16 14:42   ` Jan Kara
2009-06-30 17:44     ` Christoph Hellwig
2009-07-01 10:29       ` Aneesh Kumar K.V

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).