All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support
@ 2007-02-27 23:14 Miklos Szeredi
  2007-02-27 23:14 ` [patch 01/22] update ctime and mtime for mmaped write Miklos Szeredi
                   ` (21 more replies)
  0 siblings, 22 replies; 32+ messages in thread
From: Miklos Szeredi @ 2007-02-27 23:14 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

The first part of this series (1-7) contains miscellaneous patches,
some of which are needed for fuse writable mmap to work correctly.

Some of these are resends of patches already in -mm, with minor
updates.

The rest of the series adds shared writable mapping support to fuse,
with some write performance improvements thrown in.  Survives
fsx-linux, LTP and bash-shared-mappings.

There are dependencies on patches in -mm, notably

  per-backing_dev-dirty-and-writeback-page-accounting.patch

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 01/22] update ctime and mtime for mmaped write
  2007-02-27 23:14 [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support Miklos Szeredi
@ 2007-02-27 23:14 ` Miklos Szeredi
  2007-02-28 14:16   ` Peter Staubach
  2007-02-27 23:14 ` [patch 02/22] fix quadratic behavior of shrink_dcache_parent() Miklos Szeredi
                   ` (20 subsequent siblings)
  21 siblings, 1 reply; 32+ messages in thread
From: Miklos Szeredi @ 2007-02-27 23:14 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: mmap_mtime.patch --]
[-- Type: text/plain, Size: 10935 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

Changes:
 o moved check from __fput() to remove_vma(), which is more logical
 o changed set_page_dirty() to set_page_dirty_mapping in hugetlb.c
 o cleaned up #ifdef CONFIG_BLOCK mess


This patch makes writing to shared memory mappings update st_ctime and
st_mtime as defined by SUSv3:

   The st_ctime and st_mtime fields of a file that is mapped with
   MAP_SHARED and PROT_WRITE shall be marked for update at some point
   in the interval between a write reference to the mapped region and
   the next call to msync() with MS_ASYNC or MS_SYNC for that portion
   of the file by any process. If there is no such call and if the
   underlying file is modified as a result of a write reference, then
   these fields shall be marked for update at some time after the
   write reference.

A new address_space flag is introduced: AS_CMTIME.  This is set each
time a page is dirtied through a userspace memory mapping.  This
includes write accesses via get_user_pages().

Note, the flag is set unconditionally, even if the page is already
dirty.  This is important, because the page might have been dirtied
earlier by a non-mmap write.

This flag is checked in msync() and munmap()/mremap(), and if set, the
file times are updated and the flag is cleared

The flag is also cleared, if the time update is triggered by a normal
write.  This is not mandated by the standard, but seems to be a sane
thing to do.

Fixes Novell Bugzilla #206431.

Inspired by Peter Staubach's patch and the resulting comments.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/include/linux/pagemap.h
===================================================================
--- linux.orig/include/linux/pagemap.h	2007-02-27 14:41:06.000000000 +0100
+++ linux/include/linux/pagemap.h	2007-02-27 14:41:07.000000000 +0100
@@ -18,6 +18,7 @@
  */
 #define	AS_EIO		(__GFP_BITS_SHIFT + 0)	/* IO error on async write */
 #define AS_ENOSPC	(__GFP_BITS_SHIFT + 1)	/* ENOSPC on async write */
+#define AS_CMTIME	(__GFP_BITS_SHIFT + 2)	/* ctime/mtime update needed */
 
 static inline void mapping_set_error(struct address_space * mapping, int error)
 {
Index: linux/mm/msync.c
===================================================================
--- linux.orig/mm/msync.c	2007-02-27 14:40:56.000000000 +0100
+++ linux/mm/msync.c	2007-02-27 14:41:07.000000000 +0100
@@ -77,6 +77,7 @@ asmlinkage long sys_msync(unsigned long 
 		}
 		file = vma->vm_file;
 		start = vma->vm_end;
+		mapping_update_time(file);
 		if ((flags & MS_SYNC) && file &&
 				(vma->vm_flags & VM_SHARED)) {
 			get_file(file);
Index: linux/fs/inode.c
===================================================================
--- linux.orig/fs/inode.c	2007-02-27 14:40:56.000000000 +0100
+++ linux/fs/inode.c	2007-02-27 14:41:07.000000000 +0100
@@ -1218,6 +1218,7 @@ void file_update_time(struct file *file)
 	struct timespec now;
 	int sync_it = 0;
 
+	clear_bit(AS_CMTIME, &file->f_mapping->flags);
 	if (IS_NOCMTIME(inode))
 		return;
 	if (IS_RDONLY(inode))
@@ -1240,6 +1241,12 @@ void file_update_time(struct file *file)
 
 EXPORT_SYMBOL(file_update_time);
 
+void mapping_update_time(struct file *file)
+{
+	if (test_bit(AS_CMTIME, &file->f_mapping->flags))
+		file_update_time(file);
+}
+
 int inode_needs_sync(struct inode *inode)
 {
 	if (IS_SYNC(inode))
Index: linux/include/linux/fs.h
===================================================================
--- linux.orig/include/linux/fs.h	2007-02-27 14:40:56.000000000 +0100
+++ linux/include/linux/fs.h	2007-02-27 14:41:07.000000000 +0100
@@ -1893,6 +1893,7 @@ extern int inode_change_ok(struct inode 
 extern int __must_check inode_setattr(struct inode *, struct iattr *);
 
 extern void file_update_time(struct file *file);
+extern void mapping_update_time(struct file *file);
 
 static inline ino_t parent_ino(struct dentry *dentry)
 {
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h	2007-02-27 14:40:56.000000000 +0100
+++ linux/include/linux/mm.h	2007-02-27 14:41:07.000000000 +0100
@@ -791,6 +791,7 @@ int redirty_page_for_writepage(struct wr
 				struct page *page);
 int FASTCALL(set_page_dirty(struct page *page));
 int set_page_dirty_lock(struct page *page);
+int set_page_dirty_mapping(struct page *page);
 int clear_page_dirty_for_io(struct page *page);
 
 extern unsigned long do_mremap(unsigned long addr,
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c	2007-02-27 14:40:56.000000000 +0100
+++ linux/mm/memory.c	2007-02-27 14:41:07.000000000 +0100
@@ -676,7 +676,7 @@ static unsigned long zap_pte_range(struc
 				anon_rss--;
 			else {
 				if (pte_dirty(ptent))
-					set_page_dirty(page);
+					set_page_dirty_mapping(page);
 				if (pte_young(ptent))
 					SetPageReferenced(page);
 				file_rss--;
@@ -954,7 +954,7 @@ struct page *follow_page(struct vm_area_
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
-			set_page_dirty(page);
+			set_page_dirty_mapping(page);
 		mark_page_accessed(page);
 	}
 unlock:
@@ -1514,6 +1514,15 @@ static inline void cow_user_page(struct 
 	copy_user_highpage(dst, src, va, vma);
 }
 
+static void set_page_dirty_mapping_balance(struct page *page)
+{
+	if (set_page_dirty_mapping(page)) {
+		struct address_space *mapping = page_mapping(page);
+		if (mapping)
+			balance_dirty_pages_ratelimited(mapping);
+	}
+}
+
 /*
  * This routine handles present pages, when users try to write
  * to a shared page. It is done by copying the page to a new address
@@ -1664,7 +1673,7 @@ gotten:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (dirty_page) {
-		set_page_dirty_balance(dirty_page);
+		set_page_dirty_mapping_balance(dirty_page);
 		put_page(dirty_page);
 	}
 	return ret;
@@ -2316,7 +2325,7 @@ retry:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (dirty_page) {
-		set_page_dirty_balance(dirty_page);
+		set_page_dirty_mapping_balance(dirty_page);
 		put_page(dirty_page);
 	}
 	return ret;
Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c	2007-02-27 14:41:06.000000000 +0100
+++ linux/mm/page-writeback.c	2007-02-27 14:41:07.000000000 +0100
@@ -244,16 +244,6 @@ static void balance_dirty_pages(struct a
 		pdflush_operation(background_writeout, 0);
 }
 
-void set_page_dirty_balance(struct page *page)
-{
-	if (set_page_dirty(page)) {
-		struct address_space *mapping = page_mapping(page);
-
-		if (mapping)
-			balance_dirty_pages_ratelimited(mapping);
-	}
-}
-
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
  * @mapping: address_space which was dirtied
@@ -798,6 +788,10 @@ int redirty_page_for_writepage(struct wr
 }
 EXPORT_SYMBOL(redirty_page_for_writepage);
 
+#ifndef CONFIG_BLOCK
+#define __set_page_dirty_buffers NULL
+#endif
+
 /*
  * If the mapping doesn't provide a set_page_dirty a_op, then
  * just fall through and assume that it wants buffer_heads.
@@ -808,10 +802,8 @@ int fastcall set_page_dirty(struct page 
 
 	if (likely(mapping)) {
 		int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
-#ifdef CONFIG_BLOCK
 		if (!spd)
 			spd = __set_page_dirty_buffers;
-#endif
 		return (*spd)(page);
 	}
 	if (!PageDirty(page)) {
@@ -823,6 +815,28 @@ int fastcall set_page_dirty(struct page 
 EXPORT_SYMBOL(set_page_dirty);
 
 /*
+ * Special set_page_dirty() variant for dirtiness coming from a memory
+ * mapping.  In this case the ctime/mtime update flag needs to be set.
+ */
+int set_page_dirty_mapping(struct page *page)
+{
+	struct address_space *mapping = page_mapping(page);
+
+	if (likely(mapping)) {
+		int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
+		if (!spd)
+			spd = __set_page_dirty_buffers;
+		set_bit(AS_CMTIME, &mapping->flags);
+		return (*spd)(page);
+	}
+	if (!PageDirty(page)) {
+		if (!TestSetPageDirty(page))
+			return 1;
+	}
+	return 0;
+}
+
+/*
  * set_page_dirty() is racy if the caller has no reference against
  * page->mapping->host, and if the page is unlocked.  This is because another
  * CPU could truncate the page off the mapping and then free the mapping.
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c	2007-02-27 14:40:56.000000000 +0100
+++ linux/mm/rmap.c	2007-02-27 14:41:07.000000000 +0100
@@ -598,7 +598,7 @@ void page_remove_rmap(struct page *page,
 		 * faster for those pages still in swapcache.
 		 */
 		if (page_test_and_clear_dirty(page))
-			set_page_dirty(page);
+			set_page_dirty_mapping(page);
 		__dec_zone_page_state(page,
 				PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
 	}
@@ -643,7 +643,7 @@ static int try_to_unmap_one(struct page 
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
-		set_page_dirty(page);
+		set_page_dirty_mapping(page);
 
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
@@ -777,7 +777,7 @@ static void try_to_unmap_cluster(unsigne
 
 		/* Move the dirty bit to the physical page now the pte is gone. */
 		if (pte_dirty(pteval))
-			set_page_dirty(page);
+			set_page_dirty_mapping(page);
 
 		page_remove_rmap(page, vma);
 		page_cache_release(page);
Index: linux/include/linux/writeback.h
===================================================================
--- linux.orig/include/linux/writeback.h	2007-02-27 14:40:56.000000000 +0100
+++ linux/include/linux/writeback.h	2007-02-27 14:41:07.000000000 +0100
@@ -117,7 +117,6 @@ int sync_page_range(struct inode *inode,
 			loff_t pos, loff_t count);
 int sync_page_range_nolock(struct inode *inode, struct address_space *mapping,
 			   loff_t pos, loff_t count);
-void set_page_dirty_balance(struct page *page);
 void writeback_set_ratelimit(void);
 
 /* pdflush.c */
Index: linux/mm/mmap.c
===================================================================
--- linux.orig/mm/mmap.c	2007-02-27 14:40:56.000000000 +0100
+++ linux/mm/mmap.c	2007-02-27 14:41:07.000000000 +0100
@@ -226,8 +226,10 @@ static struct vm_area_struct *remove_vma
 	might_sleep();
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
-	if (vma->vm_file)
+	if (vma->vm_file) {
+		mapping_update_time(vma->vm_file);
 		fput(vma->vm_file);
+	}
 	mpol_free(vma_policy(vma));
 	kmem_cache_free(vm_area_cachep, vma);
 	return next;
Index: linux/mm/hugetlb.c
===================================================================
--- linux.orig/mm/hugetlb.c	2007-02-27 14:40:56.000000000 +0100
+++ linux/mm/hugetlb.c	2007-02-27 14:41:07.000000000 +0100
@@ -390,7 +390,7 @@ void __unmap_hugepage_range(struct vm_ar
 
 		page = pte_page(pte);
 		if (pte_dirty(pte))
-			set_page_dirty(page);
+			set_page_dirty_mapping(page);
 		list_add(&page->lru, &page_list);
 	}
 	spin_unlock(&mm->page_table_lock);

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 02/22] fix quadratic behavior of shrink_dcache_parent()
  2007-02-27 23:14 [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support Miklos Szeredi
  2007-02-27 23:14 ` [patch 01/22] update ctime and mtime for mmaped write Miklos Szeredi
@ 2007-02-27 23:14 ` Miklos Szeredi
  2007-02-27 23:14 ` [patch 03/22] fix deadlock in balance_dirty_pages Miklos Szeredi
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Miklos Szeredi @ 2007-02-27 23:14 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: fix_shrink_quadratic.patch --]
[-- Type: text/plain, Size: 6285 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

Changes:
 o dput already checks dentry == NULL, so remove check
   from prune_one_dentry()


The time shrink_dcache_parent() takes, grows quadratically with the
depth of the tree under 'parent'.  This starts to get noticable at
about 10,000.

These kinds of depths don't occur normally, and filesystems which
invoke shrink_dcache_parent() via d_invalidate() seem to have other
depth dependent timings, so it's not even easy to expose this problem.

However with FUSE it's easy to create a deep tree and d_invalidate()
will also get called.  This can make a syscall hang for a very long
time.

This is the original discovery of the problem by Russ Cox:

  http://article.gmane.org/gmane.comp.file-systems.fuse.devel/3826

The following patch fixes the quadratic behavior, by optionally
allowing prune_dcache() to prune ancestors of a dentry in one go,
instead of doing it one at a time.

Common code in dput() and prune_one_dentry() is extracted into a new
helper function d_kill().

shrink_dcache_parent() as well as shrink_dcache_sb() are converted to
use the ancestry-pruner option.  Only for shrink_dcache_memory() is
this behavior not desirable, so it keeps using the old algorithm.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/fs/dcache.c
===================================================================
--- linux.orig/fs/dcache.c	2007-02-27 14:40:56.000000000 +0100
+++ linux/fs/dcache.c	2007-02-27 14:41:07.000000000 +0100
@@ -121,6 +121,28 @@ static void dentry_iput(struct dentry * 
 	}
 }
 
+/**
+ * d_kill - kill dentry and return parent
+ * @dentry: dentry to kill
+ *
+ * Called with dcache_lock and d_lock, releases both.  The dentry must
+ * already be unhashed and removed from the LRU.
+ *
+ * If this is the root of the dentry tree, return NULL.
+ */
+static struct dentry *d_kill(struct dentry *dentry)
+{
+	struct dentry *parent;
+
+	list_del(&dentry->d_u.d_child);
+	dentry_stat.nr_dentry--;	/* For d_free, below */
+	/*drops the locks, at that point nobody can reach this dentry */
+	dentry_iput(dentry);
+	parent = dentry->d_parent;
+	d_free(dentry);
+	return dentry == parent ? NULL : parent;
+}
+
 /* 
  * This is dput
  *
@@ -189,28 +211,17 @@ repeat:
 
 unhash_it:
 	__d_drop(dentry);
-
-kill_it: {
-		struct dentry *parent;
-
-		/* If dentry was on d_lru list
-		 * delete it from there
-		 */
-  		if (!list_empty(&dentry->d_lru)) {
-  			list_del(&dentry->d_lru);
-  			dentry_stat.nr_unused--;
-  		}
-  		list_del(&dentry->d_u.d_child);
-		dentry_stat.nr_dentry--;	/* For d_free, below */
-		/*drops the locks, at that point nobody can reach this dentry */
-		dentry_iput(dentry);
-		parent = dentry->d_parent;
-		d_free(dentry);
-		if (dentry == parent)
-			return;
-		dentry = parent;
-		goto repeat;
+kill_it:
+	/* If dentry was on d_lru list
+	 * delete it from there
+	 */
+	if (!list_empty(&dentry->d_lru)) {
+		list_del(&dentry->d_lru);
+		dentry_stat.nr_unused--;
 	}
+	dentry = d_kill(dentry);
+	if (dentry)
+		goto repeat;
 }
 
 /**
@@ -371,22 +382,40 @@ restart:
  * Throw away a dentry - free the inode, dput the parent.  This requires that
  * the LRU list has already been removed.
  *
+ * If prune_parents is true, try to prune ancestors as well.
+ *
  * Called with dcache_lock, drops it and then regains.
  * Called with dentry->d_lock held, drops it.
  */
-static void prune_one_dentry(struct dentry * dentry)
+static void prune_one_dentry(struct dentry * dentry, int prune_parents)
 {
-	struct dentry * parent;
-
 	__d_drop(dentry);
-	list_del(&dentry->d_u.d_child);
-	dentry_stat.nr_dentry--;	/* For d_free, below */
-	dentry_iput(dentry);
-	parent = dentry->d_parent;
-	d_free(dentry);
-	if (parent != dentry)
-		dput(parent);
+	dentry = d_kill(dentry);
+	if (!prune_parents) {
+		dput(dentry);
+		spin_lock(&dcache_lock);
+		return;
+	}
+
+	/*
+	 * Prune ancestors.  Locking is simpler than in dput(),
+	 * because dcache_lock needs to be taken anyway.
+	 */
 	spin_lock(&dcache_lock);
+	while (dentry) {
+		if (!atomic_dec_and_lock(&dentry->d_count, &dentry->d_lock))
+			return;
+
+		if (dentry->d_op && dentry->d_op->d_delete)
+			dentry->d_op->d_delete(dentry);
+		if (!list_empty(&dentry->d_lru)) {
+			list_del(&dentry->d_lru);
+			dentry_stat.nr_unused--;
+		}
+		__d_drop(dentry);
+		dentry = d_kill(dentry);
+		spin_lock(&dcache_lock);
+	}
 }
 
 /**
@@ -394,6 +423,7 @@ static void prune_one_dentry(struct dent
  * @count: number of entries to try and free
  * @sb: if given, ignore dentries for other superblocks
  *         which are being unmounted.
+ * @prune_parents: if true, try to prune ancestors as well in one go
  *
  * Shrink the dcache. This is done when we need
  * more memory, or simply when we need to unmount
@@ -404,7 +434,7 @@ static void prune_one_dentry(struct dent
  * all the dentries are in use.
  */
  
-static void prune_dcache(int count, struct super_block *sb)
+static void prune_dcache(int count, struct super_block *sb, int prune_parents)
 {
 	spin_lock(&dcache_lock);
 	for (; count ; count--) {
@@ -464,7 +494,7 @@ static void prune_dcache(int count, stru
 		 * without taking the s_umount lock (I already hold it).
 		 */
 		if (sb && dentry->d_sb == sb) {
-			prune_one_dentry(dentry);
+			prune_one_dentry(dentry, prune_parents);
 			continue;
 		}
 		/*
@@ -479,7 +509,7 @@ static void prune_dcache(int count, stru
 		s_umount = &dentry->d_sb->s_umount;
 		if (down_read_trylock(s_umount)) {
 			if (dentry->d_sb->s_root != NULL) {
-				prune_one_dentry(dentry);
+				prune_one_dentry(dentry, prune_parents);
 				up_read(s_umount);
 				continue;
 			}
@@ -550,7 +580,7 @@ repeat:
 			spin_unlock(&dentry->d_lock);
 			continue;
 		}
-		prune_one_dentry(dentry);
+		prune_one_dentry(dentry, 1);
 		cond_resched_lock(&dcache_lock);
 		goto repeat;
 	}
@@ -829,7 +859,7 @@ void shrink_dcache_parent(struct dentry 
 	int found;
 
 	while ((found = select_parent(parent)) != 0)
-		prune_dcache(found, parent->d_sb);
+		prune_dcache(found, parent->d_sb, 1);
 }
 
 /*
@@ -849,7 +879,7 @@ static int shrink_dcache_memory(int nr, 
 	if (nr) {
 		if (!(gfp_mask & __GFP_FS))
 			return -1;
-		prune_dcache(nr, NULL);
+		prune_dcache(nr, NULL, 0);
 	}
 	return (dentry_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
 }

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 03/22] fix deadlock in balance_dirty_pages
  2007-02-27 23:14 [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support Miklos Szeredi
  2007-02-27 23:14 ` [patch 01/22] update ctime and mtime for mmaped write Miklos Szeredi
  2007-02-27 23:14 ` [patch 02/22] fix quadratic behavior of shrink_dcache_parent() Miklos Szeredi
@ 2007-02-27 23:14 ` Miklos Szeredi
  2007-02-27 23:14 ` [patch 04/22] fix deadlock in throttle_vm_writeout Miklos Szeredi
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Miklos Szeredi @ 2007-02-27 23:14 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: dirty_balancing_fix.patch --]
[-- Type: text/plain, Size: 3829 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

This deadlock happens, when dirty pages from one filesystem are
written back through another filesystem.  It easiest to demonstrate
with fuse although it could affect looback mounts as well (see
following patches).

Let's call the filesystems A(bove) and B(elow).  Process Pr_a is
writing to A, and process Pr_b is writing to B.

Pr_a is bash-shared-mapping.  Pr_b is the fuse filesystem daemon
(fusexmp_fh), for simplicity let's assume that Pr_b is single
threaded.

These are the simplified stack traces of these processes after the
deadlock:

Pr_a (bash-shared-mapping):

  (block on queue)
  fuse_writepage
  generic_writepages
  writeback_inodes
  balance_dirty_pages
  balance_dirty_pages_ratelimited_nr
  set_page_dirty_mapping_balance
  do_no_page


Pr_b (fusexmp_fh):

  io_schedule_timeout
  congestion_wait
  balance_dirty_pages
  balance_dirty_pages_ratelimited_nr
  generic_file_buffered_write
  generic_file_aio_write
  ext3_file_write
  do_sync_write
  vfs_write
  sys_pwrite64


Thanks to the aggressive nature of Pr_a, it can happen, that

  nr_file_dirty > dirty_thresh + margin

This is due to both nr_dirty growing and dirty_thresh shrinking, which
in turn is due to nr_file_mapped rapidly growing.  The exact size of
the margin at which the deadlock happens is not known, but it's around
100 pages.

At this point Pr_a enters balance_dirty_pages and starts to write back
some if it's dirty pages.  After submitting some requests, it blocks
on the request queue.

The first write request will trigger Pr_b to perform a write()
syscall.  This will submit a write request to the block device and
then may enter balance_dirty_pages().

The condition for exiting balance_dirty_pages() is

 - either that write_chunk pages have been written

 - or nr_file_dirty + nr_writeback < dirty_thresh

It is entirely possible that less than write_chunk pages were written,
in which case balance_dirty_pages() will not exit even after all the
submitted requests have been succesfully completed.

Which means that the write() syscall does not return.

Which means, that no more dirty pages from A will be written back, and
neither nr_writeback nor nr_file_dirty will decrease.

Which means, that balance_dirty_pages() will loop forever.

Q.E.D.

The solution is to exit balance_dirty_pages() on the condition, that
there are only a few dirty + writeback pages for this backing dev.  This
makes sure, that there is always some progress with this setup.

The number of outstanding dirty + written pages is limited to 8, which
means that when over the threshold (dirty_exceeded == 1), each
filesystem may only effectively pin a maximum of 16 (+8 because of
ratelimiting) extra pages.

Note: a similar safety vent is always needed if there's a global limit
for the dirty+writeback pages, even if in the future there will be
some per-queue (or other) soft limit.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c	2007-02-27 14:41:07.000000000 +0100
+++ linux/mm/page-writeback.c	2007-02-27 14:41:07.000000000 +0100
@@ -201,6 +201,17 @@ static void balance_dirty_pages(struct a
 		if (!dirty_exceeded)
 			dirty_exceeded = 1;
 
+		/*
+		 * Acquit producer of dirty pages if there's little or
+		 * nothing to write back to this particular queue.
+		 *
+		 * Without this check a deadlock is possible for if
+		 * one filesystem is writing data through another.
+		 */
+		if (atomic_long_read(&bdi->nr_dirty) +
+		    atomic_long_read(&bdi->nr_writeback) < 8)
+			break;
+
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
 		 * filesystems (i.e. NFS) in which data may have been

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 04/22] fix deadlock in throttle_vm_writeout
  2007-02-27 23:14 [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support Miklos Szeredi
                   ` (2 preceding siblings ...)
  2007-02-27 23:14 ` [patch 03/22] fix deadlock in balance_dirty_pages Miklos Szeredi
@ 2007-02-27 23:14 ` Miklos Szeredi
  2007-02-27 23:14 ` [patch 05/22] balance dirty pages from loop device Miklos Szeredi
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Miklos Szeredi @ 2007-02-27 23:14 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: throttle_vm_writeout_fix.patch --]
[-- Type: text/plain, Size: 4480 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

This deadlock is similar to the one in balance_dirty_pages, but
instead of waiting in balance_dirty_pages after submitting a write
request, it happens during a memory allocation for filesystem B before
submitting a write request.

It is easy to reproduce on a machine with not too much memory.
E.g. try this on 2.6.21-rc1 UML with 32MB (works on physical hw as
well):

  dd if=/dev/zero of=/tmp/tmp.img bs=1048576 count=40
  mke2fs -j -F /tmp/tmp.img
  mkdir /tmp/img
  mount -oloop /tmp/tmp.img /tmp/img
  bash-shared-mapping /tmp/img/foo 30000000

The deadlock doesn't happen immediately, sometimes only after a few
minutes.

Simplified stack trace for bash-shared-mapping after the deadlock:

  io_schedule_timeout
  congestion_wait
  balance_dirty_pages
  balance_dirty_pages_ratelimited_nr
  generic_file_buffered_write
  __generic_file_aio_write_nolock
  generic_file_aio_write
  ext3_file_write
  do_sync_write
  vfs_write
  sys_pwrite64

and for [loop0]:

  io_schedule_timeout
  congestion_wait
  throttle_vm_writeout
  shrink_zone
  shrink_zones
  try_to_free_pages
  __alloc_pages
  find_or_create_page
  do_lo_send_aops
  lo_send
  do_bio_filebacked
  loop_thread

The requirement for the deadlock is that

  nr_writeback > dirty_thresh * 1.1 + margin

Again margin seems to be in the 100 page range.

The task of throttle_vm_writeout is to limit the rate at which
under-writeback pages are created due to swapping.  There's no other
way direct reclaim can increase the nr_writeback + nr_file_dirty.

So when there are few or no under-swap pages, it is safe for this
function to return.  This ensures, that there's progress with writing
back dirty pages.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/include/linux/swap.h
===================================================================
--- linux.orig/include/linux/swap.h	2007-02-27 14:40:55.000000000 +0100
+++ linux/include/linux/swap.h	2007-02-27 14:41:08.000000000 +0100
@@ -279,10 +279,14 @@ static inline void disable_swap_token(vo
 	put_swap_token(swap_token_mm);
 }
 
+#define nr_swap_writeback \
+	atomic_long_read(&swapper_space.backing_dev_info->nr_writeback)
+
 #else /* CONFIG_SWAP */
 
 #define total_swap_pages			0
 #define total_swapcache_pages			0UL
+#define nr_swap_writeback			0UL
 
 #define si_swapinfo(val) \
 	do { (val)->freeswap = (val)->totalswap = 0; } while (0)
Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c	2007-02-27 14:41:07.000000000 +0100
+++ linux/mm/page-writeback.c	2007-02-27 14:41:08.000000000 +0100
@@ -33,6 +33,7 @@
 #include <linux/syscalls.h>
 #include <linux/buffer_head.h>
 #include <linux/pagevec.h>
+#include <linux/swap.h>
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -303,6 +304,21 @@ void throttle_vm_writeout(void)
 	long dirty_thresh;
 
         for ( ; ; ) {
+		/*
+		 * If there's no swapping going on, don't throttle.
+		 *
+		 * Starting writeback against mapped pages shouldn't
+		 * be a problem, as that doesn't increase the
+		 * sum of dirty + writeback.
+		 *
+		 * Without this, a deadlock is possible (also see
+		 * comment in balance_dirty_pages).  This has been
+		 * observed with running bash-shared-mapping on a
+		 * loopback mount.
+		 */
+		if (nr_swap_writeback < 16)
+			break;
+
 		get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
 
                 /*
@@ -314,6 +330,7 @@ void throttle_vm_writeout(void)
                 if (global_page_state(NR_UNSTABLE_NFS) +
 			global_page_state(NR_WRITEBACK) <= dirty_thresh)
                         	break;
+
                 congestion_wait(WRITE, HZ/10);
         }
 }
Index: linux/mm/page_io.c
===================================================================
--- linux.orig/mm/page_io.c	2007-02-27 14:40:55.000000000 +0100
+++ linux/mm/page_io.c	2007-02-27 14:41:08.000000000 +0100
@@ -70,6 +70,7 @@ static int end_swap_bio_write(struct bio
 		ClearPageReclaim(page);
 	}
 	end_page_writeback(page);
+	atomic_long_dec(&swapper_space.backing_dev_info->nr_writeback);
 	bio_put(bio);
 	return 0;
 }
@@ -121,6 +122,7 @@ int swap_writepage(struct page *page, st
 	if (wbc->sync_mode == WB_SYNC_ALL)
 		rw |= (1 << BIO_RW_SYNC);
 	count_vm_event(PSWPOUT);
+	atomic_long_inc(&swapper_space.backing_dev_info->nr_writeback);
 	set_page_writeback(page);
 	unlock_page(page);
 	submit_bio(rw, bio);

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 05/22] balance dirty pages from loop device
  2007-02-27 23:14 [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support Miklos Szeredi
                   ` (3 preceding siblings ...)
  2007-02-27 23:14 ` [patch 04/22] fix deadlock in throttle_vm_writeout Miklos Szeredi
@ 2007-02-27 23:14 ` Miklos Szeredi
  2007-02-27 23:14 ` [patch 06/22] consolidate generic_writepages and mpage_writepages Miklos Szeredi
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Miklos Szeredi @ 2007-02-27 23:14 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: loopback_dirty_fix.patch --]
[-- Type: text/plain, Size: 981 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

The function do_lo_send_aops() should call
balance_dirty_pages_ratelimited() after each page similarly to
generic_file_buffered_write().

Without this, writing the loop device directly (not through a
filesystem) is very slow, and also slows the whole system down,
because nr_dirty is constantly over the limit.

Beware: this patch without the fix to balance_dirty_pages() makes a
loopback mounted filesystem prone to deadlock.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/drivers/block/loop.c
===================================================================
--- linux.orig/drivers/block/loop.c	2007-02-27 14:40:55.000000000 +0100
+++ linux/drivers/block/loop.c	2007-02-27 14:41:08.000000000 +0100
@@ -275,6 +275,8 @@ static int do_lo_send_aops(struct loop_d
 		pos += size;
 		unlock_page(page);
 		page_cache_release(page);
+		balance_dirty_pages_ratelimited(mapping);
+		cond_resched();
 	}
 	ret = 0;
 out:

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 06/22] consolidate generic_writepages and mpage_writepages
  2007-02-27 23:14 [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support Miklos Szeredi
                   ` (4 preceding siblings ...)
  2007-02-27 23:14 ` [patch 05/22] balance dirty pages from loop device Miklos Szeredi
@ 2007-02-27 23:14 ` Miklos Szeredi
  2007-02-27 23:14 ` [patch 07/22] add filesystem subtype support Miklos Szeredi
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Miklos Szeredi @ 2007-02-27 23:14 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: writepages_consolidate.patch --]
[-- Type: text/plain, Size: 12592 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

Changes:
 o fix theoretical NULL pointer dereference in __mpage_writepage
 o merge Andrew Morton's cleanups


Clean up code duplication between mpage_writepages() and
generic_writepages().

The new generic function, write_cache_pages() takes a function pointer
argument, which will be called for each page to be written.

Maybe cifs_writepages() too can use this infrastructure, but I'm not
touching that with a ten-foot pole.

The upcoming page writeback support in fuse will also want this.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/fs/mpage.c
===================================================================
--- linux.orig/fs/mpage.c	2007-02-27 14:41:06.000000000 +0100
+++ linux/fs/mpage.c	2007-02-27 14:41:08.000000000 +0100
@@ -456,11 +456,18 @@ EXPORT_SYMBOL(mpage_readpage);
  * written, so it can intelligently allocate a suitably-sized BIO.  For now,
  * just allocate full-size (16-page) BIOs.
  */
-static struct bio *
-__mpage_writepage(struct bio *bio, struct page *page, get_block_t get_block,
-	sector_t *last_block_in_bio, int *ret, struct writeback_control *wbc,
-	writepage_t writepage_fn)
+struct mpage_data {
+	struct bio *bio;
+	sector_t last_block_in_bio;
+	get_block_t *get_block;
+	unsigned use_writepage;
+};
+
+static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
+			     void *data)
 {
+	struct mpage_data *mpd = data;
+	struct bio *bio = mpd->bio;
 	struct address_space *mapping = page->mapping;
 	struct inode *inode = page->mapping->host;
 	const unsigned blkbits = inode->i_blkbits;
@@ -478,6 +485,7 @@ __mpage_writepage(struct bio *bio, struc
 	int length;
 	struct buffer_head map_bh;
 	loff_t i_size = i_size_read(inode);
+	int ret = 0;
 
 	if (page_has_buffers(page)) {
 		struct buffer_head *head = page_buffers(page);
@@ -540,7 +548,7 @@ __mpage_writepage(struct bio *bio, struc
 
 		map_bh.b_state = 0;
 		map_bh.b_size = 1 << blkbits;
-		if (get_block(inode, block_in_file, &map_bh, 1))
+		if (mpd->get_block(inode, block_in_file, &map_bh, 1))
 			goto confused;
 		if (buffer_new(&map_bh))
 			unmap_underlying_metadata(map_bh.b_bdev,
@@ -589,7 +597,7 @@ page_is_mapped:
 	/*
 	 * This page will go to BIO.  Do we need to send this BIO off first?
 	 */
-	if (bio && *last_block_in_bio != blocks[0] - 1)
+	if (bio && mpd->last_block_in_bio != blocks[0] - 1)
 		bio = mpage_bio_submit(WRITE, bio);
 
 alloc_new:
@@ -646,7 +654,7 @@ alloc_new:
 					boundary_block, 1 << blkbits);
 		}
 	} else {
-		*last_block_in_bio = blocks[blocks_per_page - 1];
+		mpd->last_block_in_bio = blocks[blocks_per_page - 1];
 	}
 	goto out;
 
@@ -654,18 +662,19 @@ confused:
 	if (bio)
 		bio = mpage_bio_submit(WRITE, bio);
 
-	if (writepage_fn) {
-		*ret = (*writepage_fn)(page, wbc);
+	if (mpd->use_writepage && mapping->a_ops->writepage) {
+		ret = mapping->a_ops->writepage(page, wbc);
 	} else {
-		*ret = -EAGAIN;
+		ret = -EAGAIN;
 		goto out;
 	}
 	/*
 	 * The caller has a ref on the inode, so *mapping is stable
 	 */
-	mapping_set_error(mapping, *ret);
+	mapping_set_error(mapping, ret);
 out:
-	return bio;
+	mpd->bio = bio;
+	return ret;
 }
 
 /**
@@ -688,120 +697,27 @@ out:
  * the call was made get new I/O started against them.  If wbc->sync_mode is
  * WB_SYNC_ALL then we were called for data integrity and we must wait for
  * existing IO to complete.
- *
- * If you fix this you should check generic_writepages() also!
  */
 int
 mpage_writepages(struct address_space *mapping,
 		struct writeback_control *wbc, get_block_t get_block)
 {
-	struct backing_dev_info *bdi = mapping->backing_dev_info;
-	struct bio *bio = NULL;
-	sector_t last_block_in_bio = 0;
-	int ret = 0;
-	int done = 0;
-	int (*writepage)(struct page *page, struct writeback_control *wbc);
-	struct pagevec pvec;
-	int nr_pages;
-	pgoff_t index;
-	pgoff_t end;		/* Inclusive */
-	int scanned = 0;
-	int range_whole = 0;
-
-	if (wbc->nonblocking && bdi_write_congested(bdi)) {
-		wbc->encountered_congestion = 1;
-		return 0;
-	}
+	int ret;
 
-	writepage = NULL;
-	if (get_block == NULL)
-		writepage = mapping->a_ops->writepage;
-
-	pagevec_init(&pvec, 0);
-	if (wbc->range_cyclic) {
-		index = mapping->writeback_index; /* Start from prev offset */
-		end = -1;
-	} else {
-		index = wbc->range_start >> PAGE_CACHE_SHIFT;
-		end = wbc->range_end >> PAGE_CACHE_SHIFT;
-		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
-			range_whole = 1;
-		scanned = 1;
+	if (!get_block)
+		ret = generic_writepages(mapping, wbc);
+	else {
+		struct mpage_data mpd = {
+			.bio = NULL,
+			.last_block_in_bio = 0,
+			.get_block = get_block,
+			.use_writepage = 1,
+		};
+
+		ret = write_cache_pages(mapping, wbc, __mpage_writepage, &mpd);
+		if (mpd.bio)
+			mpage_bio_submit(WRITE, mpd.bio);
 	}
-retry:
-	while (!done && (index <= end) &&
-			(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
-			PAGECACHE_TAG_DIRTY,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
-		unsigned i;
-
-		scanned = 1;
-		for (i = 0; i < nr_pages; i++) {
-			struct page *page = pvec.pages[i];
-
-			/*
-			 * At this point we hold neither mapping->tree_lock nor
-			 * lock on the page itself: the page may be truncated or
-			 * invalidated (changing page->mapping to NULL), or even
-			 * swizzled back from swapper_space to tmpfs file
-			 * mapping
-			 */
-
-			lock_page(page);
-
-			if (unlikely(page->mapping != mapping)) {
-				unlock_page(page);
-				continue;
-			}
-
-			if (!wbc->range_cyclic && page->index > end) {
-				done = 1;
-				unlock_page(page);
-				continue;
-			}
-
-			if (wbc->sync_mode != WB_SYNC_NONE)
-				wait_on_page_writeback(page);
-
-			if (PageWriteback(page) ||
-					!clear_page_dirty_for_io(page)) {
-				unlock_page(page);
-				continue;
-			}
-
-			if (writepage) {
-				ret = (*writepage)(page, wbc);
-				mapping_set_error(mapping, ret);
-			} else {
-				bio = __mpage_writepage(bio, page, get_block,
-						&last_block_in_bio, &ret, wbc,
-						page->mapping->a_ops->writepage);
-			}
-			if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE))
-				unlock_page(page);
-			if (ret || (--(wbc->nr_to_write) <= 0))
-				done = 1;
-			if (wbc->nonblocking && bdi_write_congested(bdi)) {
-				wbc->encountered_congestion = 1;
-				done = 1;
-			}
-		}
-		pagevec_release(&pvec);
-		cond_resched();
-	}
-	if (!scanned && !done) {
-		/*
-		 * We hit the last page and there is more work to be done: wrap
-		 * back to the start of the file
-		 */
-		scanned = 1;
-		index = 0;
-		goto retry;
-	}
-	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
-		mapping->writeback_index = index;
-	if (bio)
-		mpage_bio_submit(WRITE, bio);
 	return ret;
 }
 EXPORT_SYMBOL(mpage_writepages);
@@ -809,15 +725,15 @@ EXPORT_SYMBOL(mpage_writepages);
 int mpage_writepage(struct page *page, get_block_t get_block,
 	struct writeback_control *wbc)
 {
-	int ret = 0;
-	struct bio *bio;
-	sector_t last_block_in_bio = 0;
-
-	bio = __mpage_writepage(NULL, page, get_block,
-			&last_block_in_bio, &ret, wbc, NULL);
-	if (bio)
-		mpage_bio_submit(WRITE, bio);
-
+	struct mpage_data mpd = {
+		.bio = NULL,
+		.last_block_in_bio = 0,
+		.get_block = get_block,
+		.use_writepage = 0,
+	};
+	int ret = __mpage_writepage(page, wbc, &mpd);
+	if (mpd.bio)
+		mpage_bio_submit(WRITE, mpd.bio);
 	return ret;
 }
 EXPORT_SYMBOL(mpage_writepage);
Index: linux/include/linux/mpage.h
===================================================================
--- linux.orig/include/linux/mpage.h	2007-02-27 14:40:55.000000000 +0100
+++ linux/include/linux/mpage.h	2007-02-27 14:41:08.000000000 +0100
@@ -12,7 +12,6 @@
 #ifdef CONFIG_BLOCK
 
 struct writeback_control;
-typedef int (writepage_t)(struct page *page, struct writeback_control *wbc);
 
 int mpage_readpages(struct address_space *mapping, struct list_head *pages,
 				unsigned nr_pages, get_block_t get_block);
Index: linux/include/linux/writeback.h
===================================================================
--- linux.orig/include/linux/writeback.h	2007-02-27 14:41:07.000000000 +0100
+++ linux/include/linux/writeback.h	2007-02-27 14:41:08.000000000 +0100
@@ -109,9 +109,15 @@ balance_dirty_pages_ratelimited(struct a
 	balance_dirty_pages_ratelimited_nr(mapping, 1);
 }
 
+typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
+				void *data);
+
 int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0);
-extern int generic_writepages(struct address_space *mapping,
-			      struct writeback_control *wbc);
+int generic_writepages(struct address_space *mapping,
+		       struct writeback_control *wbc);
+int write_cache_pages(struct address_space *mapping,
+		      struct writeback_control *wbc, writepage_t writepage,
+		      void *data);
 int do_writepages(struct address_space *mapping, struct writeback_control *wbc);
 int sync_page_range(struct inode *inode, struct address_space *mapping,
 			loff_t pos, loff_t count);
Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c	2007-02-27 14:41:08.000000000 +0100
+++ linux/mm/page-writeback.c	2007-02-27 14:41:08.000000000 +0100
@@ -567,31 +567,27 @@ void __init page_writeback_init(void)
 }
 
 /**
- * generic_writepages - walk the list of dirty pages of the given address space and writepage() all of them.
+ * write_cache_pages - walk the list of dirty pages of the given address space and write all of them.
  * @mapping: address space structure to write
  * @wbc: subtract the number of written pages from *@wbc->nr_to_write
+ * @writepage: function called for each page
+ * @data: data passed to writepage function
  *
- * This is a library function, which implements the writepages()
- * address_space_operation.
- *
- * If a page is already under I/O, generic_writepages() skips it, even
+ * If a page is already under I/O, write_cache_pages() skips it, even
  * if it's dirty.  This is desirable behaviour for memory-cleaning writeback,
  * but it is INCORRECT for data-integrity system calls such as fsync().  fsync()
  * and msync() need to guarantee that all the data which was dirty at the time
  * the call was made get new I/O started against them.  If wbc->sync_mode is
  * WB_SYNC_ALL then we were called for data integrity and we must wait for
  * existing IO to complete.
- *
- * Derived from mpage_writepages() - if you fix this you should check that
- * also!
  */
-int generic_writepages(struct address_space *mapping,
-		       struct writeback_control *wbc)
+int write_cache_pages(struct address_space *mapping,
+		      struct writeback_control *wbc, writepage_t writepage,
+		      void *data)
 {
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	int ret = 0;
 	int done = 0;
-	int (*writepage)(struct page *page, struct writeback_control *wbc);
 	struct pagevec pvec;
 	int nr_pages;
 	pgoff_t index;
@@ -604,12 +600,6 @@ int generic_writepages(struct address_sp
 		return 0;
 	}
 
-	writepage = mapping->a_ops->writepage;
-
-	/* deal with chardevs and other special file */
-	if (!writepage)
-		return 0;
-
 	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
 		index = mapping->writeback_index; /* Start from prev offset */
@@ -661,8 +651,7 @@ retry:
 				continue;
 			}
 
-			ret = (*writepage)(page, wbc);
-			mapping_set_error(mapping, ret);
+			ret = (*writepage)(page, wbc, data);
 
 			if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE))
 				unlock_page(page);
@@ -689,6 +678,38 @@ retry:
 		mapping->writeback_index = index;
 	return ret;
 }
+EXPORT_SYMBOL(write_cache_pages);
+
+/*
+ * Function used by generic_writepages to call the real writepage
+ * function and set the mapping flags on error
+ */
+static int __writepage(struct page *page, struct writeback_control *wbc,
+		       void *data)
+{
+	struct address_space *mapping = data;
+	int ret = mapping->a_ops->writepage(page, wbc);
+	mapping_set_error(mapping, ret);
+	return ret;
+}
+
+/**
+ * generic_writepages - walk the list of dirty pages of the given address space and writepage() all of them.
+ * @mapping: address space structure to write
+ * @wbc: subtract the number of written pages from *@wbc->nr_to_write
+ *
+ * This is a library function, which implements the writepages()
+ * address_space_operation.
+ */
+int generic_writepages(struct address_space *mapping,
+		       struct writeback_control *wbc)
+{
+	/* deal with chardevs and other special file */
+	if (!mapping->a_ops->writepage)
+		return 0;
+
+	return write_cache_pages(mapping, wbc, __writepage, mapping);
+}
 
 EXPORT_SYMBOL(generic_writepages);
 

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 07/22] add filesystem subtype support
  2007-02-27 23:14 [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support Miklos Szeredi
                   ` (5 preceding siblings ...)
  2007-02-27 23:14 ` [patch 06/22] consolidate generic_writepages and mpage_writepages Miklos Szeredi
@ 2007-02-27 23:14 ` Miklos Szeredi
  2007-02-27 23:14 ` [patch 08/22] fuse: update backing_dev_info congestion state Miklos Szeredi
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Miklos Szeredi @ 2007-02-27 23:14 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: filesystem_subtypes.patch --]
[-- Type: text/plain, Size: 6838 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

There's a slight problem with filesystem type representation in fuse
based filesystems.

>From the kernel's view, there are just two filesystem types: fuse and
fuseblk.  From the user's view there are lots of different filesystem
types.  The user is not even much concerned if the filesystem is fuse
based or not.  So there's a conflict of interest in how this should be
represented in fstab, mtab and /proc/mounts.

The current scheme is to encode the real filesystem type in the mount
source.  So an sshfs mount looks like this:

  sshfs#user@server:/   /mnt/server    fuse   rw,nosuid,nodev,...

This url-ish syntax works OK for sshfs and similar filesystems.
However for block device based filesystems (ntfs-3g, zfs) it doesn't
work, since the kernel expects the mount source to be a real device
name.

A possibly better scheme would be to encode the real type in the type
field as "type.subtype".  So fuse mounts would look like this:

  /dev/hda1       /mnt/windows   fuseblk.ntfs-3g   rw,...
  user@server:/   /mnt/server    fuse.sshfs        rw,nosuid,nodev,...

This patch adds the necessary code to the kernel so that this can be
correctly displayed in /proc/mounts.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/fs/filesystems.c
===================================================================
--- linux.orig/fs/filesystems.c	2007-02-27 14:40:55.000000000 +0100
+++ linux/fs/filesystems.c	2007-02-27 14:41:09.000000000 +0100
@@ -41,11 +41,12 @@ void put_filesystem(struct file_system_t
 	module_put(fs->owner);
 }
 
-static struct file_system_type **find_filesystem(const char *name)
+static struct file_system_type **find_filesystem(const char *name, unsigned len)
 {
 	struct file_system_type **p;
 	for (p=&file_systems; *p; p=&(*p)->next)
-		if (strcmp((*p)->name,name) == 0)
+		if (strlen((*p)->name) == len &&
+		    strncmp((*p)->name, name, len) == 0)
 			break;
 	return p;
 }
@@ -68,11 +69,12 @@ int register_filesystem(struct file_syst
 	int res = 0;
 	struct file_system_type ** p;
 
+	BUG_ON(strchr(fs->name, '.'));
 	if (fs->next)
 		return -EBUSY;
 	INIT_LIST_HEAD(&fs->fs_supers);
 	write_lock(&file_systems_lock);
-	p = find_filesystem(fs->name);
+	p = find_filesystem(fs->name, strlen(fs->name));
 	if (*p)
 		res = -EBUSY;
 	else
@@ -215,19 +217,26 @@ int get_filesystem_list(char * buf)
 struct file_system_type *get_fs_type(const char *name)
 {
 	struct file_system_type *fs;
+	const char *dot = strchr(name, '.');
+	unsigned len = dot ? dot - name : strlen(name);
 
 	read_lock(&file_systems_lock);
-	fs = *(find_filesystem(name));
+	fs = *(find_filesystem(name, len));
 	if (fs && !try_module_get(fs->owner))
 		fs = NULL;
 	read_unlock(&file_systems_lock);
-	if (!fs && (request_module("%s", name) == 0)) {
+	if (!fs && (request_module("%.*s", len, name) == 0)) {
 		read_lock(&file_systems_lock);
-		fs = *(find_filesystem(name));
+		fs = *(find_filesystem(name, len));
 		if (fs && !try_module_get(fs->owner))
 			fs = NULL;
 		read_unlock(&file_systems_lock);
 	}
+
+	if (dot && fs && !(fs->fs_flags & FS_HAS_SUBTYPE)) {
+		put_filesystem(fs);
+		fs = NULL;
+	}
 	return fs;
 }
 
Index: linux/fs/fuse/inode.c
===================================================================
--- linux.orig/fs/fuse/inode.c	2007-02-27 14:40:55.000000000 +0100
+++ linux/fs/fuse/inode.c	2007-02-27 14:41:09.000000000 +0100
@@ -634,6 +634,7 @@ static int fuse_get_sb(struct file_syste
 static struct file_system_type fuse_fs_type = {
 	.owner		= THIS_MODULE,
 	.name		= "fuse",
+	.fs_flags	= FS_HAS_SUBTYPE,
 	.get_sb		= fuse_get_sb,
 	.kill_sb	= kill_anon_super,
 };
@@ -650,6 +651,7 @@ static int fuse_get_sb_blk(struct file_s
 static struct file_system_type fuseblk_fs_type = {
 	.owner		= THIS_MODULE,
 	.name		= "fuseblk",
+	.fs_flags	= FS_HAS_SUBTYPE,
 	.get_sb		= fuse_get_sb_blk,
 	.kill_sb	= kill_block_super,
 	.fs_flags	= FS_REQUIRES_DEV,
Index: linux/fs/namespace.c
===================================================================
--- linux.orig/fs/namespace.c	2007-02-27 14:40:55.000000000 +0100
+++ linux/fs/namespace.c	2007-02-27 14:41:09.000000000 +0100
@@ -377,6 +377,10 @@ static int show_vfsmnt(struct seq_file *
 	seq_path(m, mnt, mnt->mnt_root, " \t\n\\");
 	seq_putc(m, ' ');
 	mangle(m, mnt->mnt_sb->s_type->name);
+	if (mnt->mnt_sb->s_subtype && mnt->mnt_sb->s_subtype[0]) {
+		seq_putc(m, '.');
+		mangle(m, mnt->mnt_sb->s_subtype);
+	}
 	seq_puts(m, mnt->mnt_sb->s_flags & MS_RDONLY ? " ro" : " rw");
 	for (fs_infop = fs_info; fs_infop->flag; fs_infop++) {
 		if (mnt->mnt_sb->s_flags & fs_infop->flag)
Index: linux/fs/super.c
===================================================================
--- linux.orig/fs/super.c	2007-02-27 14:40:55.000000000 +0100
+++ linux/fs/super.c	2007-02-27 14:41:09.000000000 +0100
@@ -107,6 +107,7 @@ out:
 static inline void destroy_super(struct super_block *s)
 {
 	security_sb_free(s);
+	kfree(s->s_subtype);
 	kfree(s);
 }
 
@@ -919,6 +920,29 @@ out:
 
 EXPORT_SYMBOL_GPL(vfs_kern_mount);
 
+static struct vfsmount *fs_set_subtype(struct vfsmount *mnt, const char *fstype)
+{
+	int err;
+	const char *subtype = strchr(fstype, '.');
+	if (subtype) {
+		subtype++;
+		err = -EINVAL;
+		if (!subtype[0])
+			goto err;
+	} else
+		subtype = "";
+
+	mnt->mnt_sb->s_subtype = kstrdup(subtype, GFP_KERNEL);
+	err = -ENOMEM;
+	if (!mnt->mnt_sb->s_subtype)
+		goto err;
+	return mnt;
+
+ err:
+	mntput(mnt);
+	return ERR_PTR(err);
+}
+
 struct vfsmount *
 do_kern_mount(const char *fstype, int flags, const char *name, void *data)
 {
@@ -927,6 +951,9 @@ do_kern_mount(const char *fstype, int fl
 	if (!type)
 		return ERR_PTR(-ENODEV);
 	mnt = vfs_kern_mount(type, flags, name, data);
+	if (!IS_ERR(mnt) && (type->fs_flags & FS_HAS_SUBTYPE) &&
+	    !mnt->mnt_sb->s_subtype)
+		mnt = fs_set_subtype(mnt, fstype);
 	put_filesystem(type);
 	return mnt;
 }
Index: linux/include/linux/fs.h
===================================================================
--- linux.orig/include/linux/fs.h	2007-02-27 14:41:07.000000000 +0100
+++ linux/include/linux/fs.h	2007-02-27 14:41:09.000000000 +0100
@@ -91,6 +91,7 @@ extern int dir_notify_enable;
 /* public flags for file_system_type */
 #define FS_REQUIRES_DEV 1 
 #define FS_BINARY_MOUNTDATA 2
+#define FS_HAS_SUBTYPE 4
 #define FS_REVAL_DOT	16384	/* Check the paths ".", ".." for staleness */
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move()
 					 * during rename() internally.
@@ -951,6 +952,12 @@ struct super_block {
 	/* Granularity of c/m/atime in ns.
 	   Cannot be worse than a second */
 	u32		   s_time_gran;
+
+	/*
+	 * Filesystem subtype.  If non-empty the filesystem type field
+	 * in /proc/mounts will be "type.subtype"
+	 */
+	char *s_subtype;
 };
 
 extern struct timespec current_fs_time(struct super_block *sb);

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 08/22] fuse: update backing_dev_info congestion state
  2007-02-27 23:14 [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support Miklos Szeredi
                   ` (6 preceding siblings ...)
  2007-02-27 23:14 ` [patch 07/22] add filesystem subtype support Miklos Szeredi
@ 2007-02-27 23:14 ` Miklos Szeredi
  2007-02-27 23:14 ` [patch 09/22] fuse: fix reserved request wake up Miklos Szeredi
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Miklos Szeredi @ 2007-02-27 23:14 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: fuse_congestion.patch --]
[-- Type: text/plain, Size: 1800 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

Set the read and write congestion state if the request queue is close
to blocking, and clear it when it's not.

This prevents unnecessary blocking in readahead and writeback.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/fs/fuse/dev.c
===================================================================
--- linux.orig/fs/fuse/dev.c	2007-02-27 14:40:55.000000000 +0100
+++ linux/fs/fuse/dev.c	2007-02-27 14:41:09.000000000 +0100
@@ -224,6 +224,10 @@ static void request_end(struct fuse_conn
 			fc->blocked = 0;
 			wake_up_all(&fc->blocked_waitq);
 		}
+		if (fc->num_background == FUSE_CONGESTION_THRESHOLD) {
+			clear_bdi_congested(&fc->bdi, READ);
+			clear_bdi_congested(&fc->bdi, WRITE);
+		}
 		fc->num_background--;
 	}
 	spin_unlock(&fc->lock);
@@ -378,6 +382,10 @@ static void request_send_nowait(struct f
 		fc->num_background++;
 		if (fc->num_background == FUSE_MAX_BACKGROUND)
 			fc->blocked = 1;
+		if (fc->num_background == FUSE_CONGESTION_THRESHOLD) {
+			set_bdi_congested(&fc->bdi, READ);
+			set_bdi_congested(&fc->bdi, WRITE);
+		}
 
 		queue_request(fc, req);
 		spin_unlock(&fc->lock);
Index: linux/fs/fuse/fuse_i.h
===================================================================
--- linux.orig/fs/fuse/fuse_i.h	2007-02-27 14:40:55.000000000 +0100
+++ linux/fs/fuse/fuse_i.h	2007-02-27 14:41:09.000000000 +0100
@@ -20,7 +20,10 @@
 #define FUSE_MAX_PAGES_PER_REQ 32
 
 /** Maximum number of outstanding background requests */
-#define FUSE_MAX_BACKGROUND 10
+#define FUSE_MAX_BACKGROUND 12
+
+/** Congestion starts at 75% of maximum */
+#define FUSE_CONGESTION_THRESHOLD (FUSE_MAX_BACKGROUND * 75 / 100)
 
 /** It could be as large as PATH_MAX, but would that have any uses? */
 #define FUSE_NAME_MAX 1024

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 09/22] fuse: fix reserved request wake up
  2007-02-27 23:14 [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support Miklos Szeredi
                   ` (7 preceding siblings ...)
  2007-02-27 23:14 ` [patch 08/22] fuse: update backing_dev_info congestion state Miklos Szeredi
@ 2007-02-27 23:14 ` Miklos Szeredi
  2007-02-27 23:14 ` [patch 10/22] fuse: add reference counting to fuse_file Miklos Szeredi
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Miklos Szeredi @ 2007-02-27 23:14 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: fuse_wake_up_fix.patch --]
[-- Type: text/plain, Size: 2425 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

Use wake_up_all instead of wake_up in put_reserved_req(), otherwise it
is possible that the right task is not woken up.

Also create a separate reserved_req_waitq in addition to the
blocked_waitq, since they fulfill totally separate functions.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/fs/fuse/dev.c
===================================================================
--- linux.orig/fs/fuse/dev.c	2007-02-27 14:41:09.000000000 +0100
+++ linux/fs/fuse/dev.c	2007-02-27 14:41:10.000000000 +0100
@@ -129,7 +129,7 @@ static struct fuse_req *get_reserved_req
 	struct fuse_file *ff = file->private_data;
 
 	do {
-		wait_event(fc->blocked_waitq, ff->reserved_req);
+		wait_event(fc->reserved_req_waitq, ff->reserved_req);
 		spin_lock(&fc->lock);
 		if (ff->reserved_req) {
 			req = ff->reserved_req;
@@ -155,7 +155,7 @@ static void put_reserved_req(struct fuse
 	fuse_request_init(req);
 	BUG_ON(ff->reserved_req);
 	ff->reserved_req = req;
-	wake_up(&fc->blocked_waitq);
+	wake_up_all(&fc->reserved_req_waitq);
 	spin_unlock(&fc->lock);
 	fput(file);
 }
Index: linux/fs/fuse/fuse_i.h
===================================================================
--- linux.orig/fs/fuse/fuse_i.h	2007-02-27 14:41:09.000000000 +0100
+++ linux/fs/fuse/fuse_i.h	2007-02-27 14:41:10.000000000 +0100
@@ -289,6 +289,9 @@ struct fuse_conn {
 	/** waitq for blocked connection */
 	wait_queue_head_t blocked_waitq;
 
+	/** waitq for reserved requests */
+	wait_queue_head_t reserved_req_waitq;
+
 	/** The next unique request id */
 	u64 reqctr;
 
Index: linux/fs/fuse/inode.c
===================================================================
--- linux.orig/fs/fuse/inode.c	2007-02-27 14:41:09.000000000 +0100
+++ linux/fs/fuse/inode.c	2007-02-27 14:41:10.000000000 +0100
@@ -231,6 +231,7 @@ static void fuse_put_super(struct super_
 	kill_fasync(&fc->fasync, SIGIO, POLL_IN);
 	wake_up_all(&fc->waitq);
 	wake_up_all(&fc->blocked_waitq);
+	wake_up_all(&fc->reserved_req_waitq);
 	mutex_lock(&fuse_mutex);
 	list_del(&fc->entry);
 	fuse_ctl_remove_conn(fc);
@@ -406,6 +407,7 @@ static struct fuse_conn *new_conn(void)
 		atomic_set(&fc->count, 1);
 		init_waitqueue_head(&fc->waitq);
 		init_waitqueue_head(&fc->blocked_waitq);
+		init_waitqueue_head(&fc->reserved_req_waitq);
 		INIT_LIST_HEAD(&fc->pending);
 		INIT_LIST_HEAD(&fc->processing);
 		INIT_LIST_HEAD(&fc->io);

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 10/22] fuse: add reference counting to fuse_file
  2007-02-27 23:14 [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support Miklos Szeredi
                   ` (8 preceding siblings ...)
  2007-02-27 23:14 ` [patch 09/22] fuse: fix reserved request wake up Miklos Szeredi
@ 2007-02-27 23:14 ` Miklos Szeredi
  2007-02-27 23:14 ` [patch 11/22] fuse: add truncation semaphore Miklos Szeredi
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Miklos Szeredi @ 2007-02-27 23:14 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: fuse_refcounted_fuse_file.patch --]
[-- Type: text/plain, Size: 9823 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

Make lifetime of 'struct fuse_file' independent from 'struct file' by
adding a reference counter and destructor.

This will enable asynchronous page writeback, where it cannot be
guaranteed, that the file is not released while a request with this
file handle is being served.

The actual RELEASE request is only sent when there are no more
references to the fuse_file.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/fs/fuse/file.c
===================================================================
--- linux.orig/fs/fuse/file.c	2007-02-27 14:40:55.000000000 +0100
+++ linux/fs/fuse/file.c	2007-02-27 14:41:10.000000000 +0100
@@ -53,6 +53,7 @@ struct fuse_file *fuse_file_alloc(void)
 			kfree(ff);
 			ff = NULL;
 		}
+		atomic_set(&ff->count, 0);
 	}
 	return ff;
 }
@@ -63,6 +64,22 @@ void fuse_file_free(struct fuse_file *ff
 	kfree(ff);
 }
 
+static struct fuse_file *fuse_file_get(struct fuse_file *ff)
+{
+	atomic_inc(&ff->count);
+	return ff;
+}
+
+static void fuse_file_put(struct fuse_file *ff)
+{
+	if (atomic_dec_and_test(&ff->count)) {
+		struct fuse_req *req = ff->reserved_req;
+		struct fuse_conn *fc = get_fuse_conn(req->dentry->d_inode);
+		request_send_background(fc, req);
+		kfree(ff);
+	}
+}
+
 void fuse_finish_open(struct inode *inode, struct file *file,
 		      struct fuse_file *ff, struct fuse_open_out *outarg)
 {
@@ -71,7 +88,7 @@ void fuse_finish_open(struct inode *inod
 	if (!(outarg->open_flags & FOPEN_KEEP_CACHE))
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 	ff->fh = outarg->fh;
-	file->private_data = ff;
+	file->private_data = fuse_file_get(ff);
 }
 
 int fuse_open_common(struct inode *inode, struct file *file, int isdir)
@@ -112,8 +129,7 @@ int fuse_open_common(struct inode *inode
 	return err;
 }
 
-struct fuse_req *fuse_release_fill(struct fuse_file *ff, u64 nodeid, int flags,
-				   int opcode)
+void fuse_release_fill(struct fuse_file *ff, u64 nodeid, int flags, int opcode)
 {
 	struct fuse_req *req = ff->reserved_req;
 	struct fuse_release_in *inarg = &req->misc.release_in;
@@ -125,25 +141,24 @@ struct fuse_req *fuse_release_fill(struc
 	req->in.numargs = 1;
 	req->in.args[0].size = sizeof(struct fuse_release_in);
 	req->in.args[0].value = inarg;
-	kfree(ff);
-
-	return req;
 }
 
 int fuse_release_common(struct inode *inode, struct file *file, int isdir)
 {
 	struct fuse_file *ff = file->private_data;
 	if (ff) {
-		struct fuse_conn *fc = get_fuse_conn(inode);
-		struct fuse_req *req;
-
-		req = fuse_release_fill(ff, get_node_id(inode), file->f_flags,
-					isdir ? FUSE_RELEASEDIR : FUSE_RELEASE);
+		fuse_release_fill(ff, get_node_id(inode), file->f_flags,
+				  isdir ? FUSE_RELEASEDIR : FUSE_RELEASE);
 
 		/* Hold vfsmount and dentry until release is finished */
-		req->vfsmount = mntget(file->f_path.mnt);
-		req->dentry = dget(file->f_path.dentry);
-		request_send_background(fc, req);
+		ff->reserved_req->vfsmount = mntget(file->f_path.mnt);
+		ff->reserved_req->dentry = dget(file->f_path.dentry);
+		/*
+		 * Normally this will send the RELEASE request,
+		 * however if some asynchronous READ or WRITE requests
+		 * are outstanding, the sending will be delayed
+		 */
+		fuse_file_put(ff);
 	}
 
 	/* Return value is ignored by VFS */
@@ -263,10 +278,9 @@ static int fuse_fsync(struct file *file,
 	return fuse_fsync_common(file, de, datasync, 0);
 }
 
-void fuse_read_fill(struct fuse_req *req, struct file *file,
+void fuse_read_fill(struct fuse_req *req, struct fuse_file *ff,
 		    struct inode *inode, loff_t pos, size_t count, int opcode)
 {
-	struct fuse_file *ff = file->private_data;
 	struct fuse_read_in *inarg = &req->misc.read_in;
 
 	inarg->fh = ff->fh;
@@ -287,7 +301,8 @@ static size_t fuse_send_read(struct fuse
 			     struct inode *inode, loff_t pos, size_t count)
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
-	fuse_read_fill(req, file, inode, pos, count, FUSE_READ);
+	struct fuse_file *ff = file->private_data;
+	fuse_read_fill(req, ff, inode, pos, count, FUSE_READ);
 	request_send(fc, req);
 	return req->out.args[0].size;
 }
@@ -336,20 +351,21 @@ static void fuse_readpages_end(struct fu
 			SetPageError(page);
 		unlock_page(page);
 	}
+	if (req->ff)
+		fuse_file_put(req->ff);
 	fuse_put_request(fc, req);
 }
 
-static void fuse_send_readpages(struct fuse_req *req, struct file *file,
+static void fuse_send_readpages(struct fuse_req *req, struct fuse_file *ff,
 				struct inode *inode)
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	loff_t pos = page_offset(req->pages[0]);
 	size_t count = req->num_pages << PAGE_CACHE_SHIFT;
 	req->out.page_zeroing = 1;
-	fuse_read_fill(req, file, inode, pos, count, FUSE_READ);
+	fuse_read_fill(req, ff, inode, pos, count, FUSE_READ);
 	if (fc->async_read) {
-		get_file(file);
-		req->file = file;
+		req->ff = fuse_file_get(ff);
 		req->end = fuse_readpages_end;
 		request_send_background(fc, req);
 	} else {
@@ -358,15 +374,15 @@ static void fuse_send_readpages(struct f
 	}
 }
 
-struct fuse_readpages_data {
+struct fuse_fill_data {
 	struct fuse_req *req;
-	struct file *file;
+	struct fuse_file *ff;
 	struct inode *inode;
 };
 
 static int fuse_readpages_fill(void *_data, struct page *page)
 {
-	struct fuse_readpages_data *data = _data;
+	struct fuse_fill_data *data = _data;
 	struct fuse_req *req = data->req;
 	struct inode *inode = data->inode;
 	struct fuse_conn *fc = get_fuse_conn(inode);
@@ -375,7 +391,7 @@ static int fuse_readpages_fill(void *_da
 	    (req->num_pages == FUSE_MAX_PAGES_PER_REQ ||
 	     (req->num_pages + 1) * PAGE_CACHE_SIZE > fc->max_read ||
 	     req->pages[req->num_pages - 1]->index + 1 != page->index)) {
-		fuse_send_readpages(req, data->file, inode);
+		fuse_send_readpages(req, data->ff, inode);
 		data->req = req = fuse_get_req(fc);
 		if (IS_ERR(req)) {
 			unlock_page(page);
@@ -392,14 +408,14 @@ static int fuse_readpages(struct file *f
 {
 	struct inode *inode = mapping->host;
 	struct fuse_conn *fc = get_fuse_conn(inode);
-	struct fuse_readpages_data data;
+	struct fuse_fill_data data;
 	int err;
 
 	err = -EIO;
 	if (is_bad_inode(inode))
 		goto out;
 
-	data.file = file;
+	data.ff = file->private_data;
 	data.inode = inode;
 	data.req = fuse_get_req(fc);
 	err = PTR_ERR(data.req);
@@ -409,7 +425,7 @@ static int fuse_readpages(struct file *f
 	err = read_cache_pages(mapping, pages, fuse_readpages_fill, &data);
 	if (!err) {
 		if (data.req->num_pages)
-			fuse_send_readpages(data.req, file, inode);
+			fuse_send_readpages(data.req, data.ff, inode);
 		else
 			fuse_put_request(fc, data.req);
 	}
Index: linux/fs/fuse/fuse_i.h
===================================================================
--- linux.orig/fs/fuse/fuse_i.h	2007-02-27 14:41:10.000000000 +0100
+++ linux/fs/fuse/fuse_i.h	2007-02-27 14:41:10.000000000 +0100
@@ -72,6 +72,9 @@ struct fuse_file {
 
 	/** File handle used by userspace */
 	u64 fh;
+
+	/** Refcount */
+	atomic_t count;
 };
 
 /** One input argument of a request */
@@ -216,7 +219,7 @@ struct fuse_req {
 	unsigned page_offset;
 
 	/** File used in the request (or NULL) */
-	struct file *file;
+	struct fuse_file *ff;
 
 	/** vfsmount used in release */
 	struct vfsmount *vfsmount;
@@ -420,7 +423,7 @@ void fuse_send_forget(struct fuse_conn *
 /**
  * Initialize READ or READDIR request
  */
-void fuse_read_fill(struct fuse_req *req, struct file *file,
+void fuse_read_fill(struct fuse_req *req, struct fuse_file *ff,
 		    struct inode *inode, loff_t pos, size_t count, int opcode);
 
 /**
@@ -433,9 +436,9 @@ void fuse_file_free(struct fuse_file *ff
 void fuse_finish_open(struct inode *inode, struct file *file,
 		      struct fuse_file *ff, struct fuse_open_out *outarg);
 
-/** */
-struct fuse_req *fuse_release_fill(struct fuse_file *ff, u64 nodeid, int flags,
-				   int opcode);
+/** Fill in ff->reserved_req with a RELEASE request */
+void fuse_release_fill(struct fuse_file *ff, u64 nodeid, int flags, int opcode);
+
 /**
  * Send RELEASE or RELEASEDIR request
  */
Index: linux/fs/fuse/dir.c
===================================================================
--- linux.orig/fs/fuse/dir.c	2007-02-27 14:40:55.000000000 +0100
+++ linux/fs/fuse/dir.c	2007-02-27 14:41:10.000000000 +0100
@@ -287,12 +287,11 @@ static struct dentry *fuse_lookup(struct
 static void fuse_sync_release(struct fuse_conn *fc, struct fuse_file *ff,
 			      u64 nodeid, int flags)
 {
-	struct fuse_req *req;
-
-	req = fuse_release_fill(ff, nodeid, flags, FUSE_RELEASE);
-	req->force = 1;
-	request_send(fc, req);
-	fuse_put_request(fc, req);
+	fuse_release_fill(ff, nodeid, flags, FUSE_RELEASE);
+	ff->reserved_req->force = 1;
+	request_send(fc, ff->reserved_req);
+	fuse_put_request(fc, ff->reserved_req);
+	kfree(ff);
 }
 
 /*
@@ -858,6 +857,7 @@ static int fuse_readdir(struct file *fil
 	struct page *page;
 	struct inode *inode = file->f_path.dentry->d_inode;
 	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_file *ff = file->private_data;
 	struct fuse_req *req;
 
 	if (is_bad_inode(inode))
@@ -874,7 +874,7 @@ static int fuse_readdir(struct file *fil
 	}
 	req->num_pages = 1;
 	req->pages[0] = page;
-	fuse_read_fill(req, file, inode, file->f_pos, PAGE_SIZE, FUSE_READDIR);
+	fuse_read_fill(req, ff, inode, file->f_pos, PAGE_SIZE, FUSE_READDIR);
 	request_send(fc, req);
 	nbytes = req->out.args[0].size;
 	err = req->out.h.error;
Index: linux/fs/fuse/dev.c
===================================================================
--- linux.orig/fs/fuse/dev.c	2007-02-27 14:41:10.000000000 +0100
+++ linux/fs/fuse/dev.c	2007-02-27 14:41:10.000000000 +0100
@@ -233,8 +233,6 @@ static void request_end(struct fuse_conn
 	spin_unlock(&fc->lock);
 	dput(req->dentry);
 	mntput(req->vfsmount);
-	if (req->file)
-		fput(req->file);
 	wake_up(&req->waitq);
 	if (end)
 		end(fc, req);

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 11/22] fuse: add truncation semaphore
  2007-02-27 23:14 [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support Miklos Szeredi
                   ` (9 preceding siblings ...)
  2007-02-27 23:14 ` [patch 10/22] fuse: add reference counting to fuse_file Miklos Szeredi
@ 2007-02-27 23:14 ` Miklos Szeredi
  2007-02-27 23:14 ` [patch 12/22] fuse: fix page invalidation Miklos Szeredi
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Miklos Szeredi @ 2007-02-27 23:14 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: fuse_truncate_sem.patch --]
[-- Type: text/plain, Size: 4073 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

Add a new semaphore to prevent asynchronous page writeback during the
TRUNCATE request.

Using i_alloc_sem would almost work, but it has to be released before
invalidating the truncated pages, so it's easier to define a separate
one.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/fs/fuse/dir.c
===================================================================
--- linux.orig/fs/fuse/dir.c	2007-02-27 14:41:10.000000000 +0100
+++ linux/fs/fuse/dir.c	2007-02-27 14:41:10.000000000 +0100
@@ -981,19 +981,9 @@ static void iattr_to_fattr(struct iattr 
 
 static void fuse_vmtruncate(struct inode *inode, loff_t offset)
 {
-	struct fuse_conn *fc = get_fuse_conn(inode);
-	int need_trunc;
-
-	spin_lock(&fc->lock);
-	need_trunc = inode->i_size > offset;
-	i_size_write(inode, offset);
-	spin_unlock(&fc->lock);
-
-	if (need_trunc) {
-		struct address_space *mapping = inode->i_mapping;
-		unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
-		truncate_inode_pages(mapping, offset);
-	}
+	struct address_space *mapping = inode->i_mapping;
+	unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
+	truncate_inode_pages(mapping, offset);
 }
 
 /*
@@ -1037,6 +1027,9 @@ static int fuse_setattr(struct dentry *e
 	if (IS_ERR(req))
 		return PTR_ERR(req);
 
+	if (is_truncate)
+		down_write(&fi->truncate_sem);
+
 	memset(&inarg, 0, sizeof(inarg));
 	iattr_to_fattr(attr, &inarg);
 	req->in.h.opcode = FUSE_SETATTR;
@@ -1050,20 +1043,41 @@ static int fuse_setattr(struct dentry *e
 	request_send(fc, req);
 	err = req->out.h.error;
 	fuse_put_request(fc, req);
-	if (!err) {
-		if ((inode->i_mode ^ outarg.attr.mode) & S_IFMT) {
-			make_bad_inode(inode);
-			err = -EIO;
-		} else {
-			if (is_truncate)
-				fuse_vmtruncate(inode, outarg.attr.size);
-			fuse_change_attributes(inode, &outarg.attr);
-			fi->i_time = time_to_jiffies(outarg.attr_valid,
-						     outarg.attr_valid_nsec);
-		}
-	} else if (err == -EINTR)
+	if (err == -EINTR)
 		fuse_invalidate_attr(inode);
+	if (err)
+		goto out_err;
 
+	err = -EIO;
+	if ((inode->i_mode ^ outarg.attr.mode) & S_IFMT) {
+		make_bad_inode(inode);
+		goto out_err;
+	}
+
+	if (is_truncate) {
+		int need_trunc;
+
+		spin_lock(&fc->lock);
+		need_trunc = inode->i_size > outarg.attr.size;
+		i_size_write(inode, outarg.attr.size);
+		spin_unlock(&fc->lock);
+
+		/*
+		 * Lock ordering is: page-lock -> truncate_sem, because
+		 * writepage is called with a locked page.  So release
+		 * truncate_sem before truncating the mapping
+		 */
+		up_write(&fi->truncate_sem);
+		if (need_trunc)
+			fuse_vmtruncate(inode, outarg.attr.size);
+	}
+	fuse_change_attributes(inode, &outarg.attr);
+	fi->i_time = time_to_jiffies(outarg.attr_valid, outarg.attr_valid_nsec);
+	return 0;
+
+ out_err:
+	if (is_truncate)
+		up_write(&fi->truncate_sem);
 	return err;
 }
 
Index: linux/fs/fuse/fuse_i.h
===================================================================
--- linux.orig/fs/fuse/fuse_i.h	2007-02-27 14:41:10.000000000 +0100
+++ linux/fs/fuse/fuse_i.h	2007-02-27 14:41:10.000000000 +0100
@@ -15,6 +15,7 @@
 #include <linux/mm.h>
 #include <linux/backing-dev.h>
 #include <linux/mutex.h>
+#include <linux/rwsem.h>
 
 /** Max number of pages that can be used in a single read request */
 #define FUSE_MAX_PAGES_PER_REQ 32
@@ -63,6 +64,9 @@ struct fuse_inode {
 
 	/** Time in jiffies until the file attributes are valid */
 	u64 i_time;
+
+	/** Semaphore for synchronizing writes with truncate */
+	struct rw_semaphore truncate_sem;
 };
 
 /** FUSE specific file data */
Index: linux/fs/fuse/inode.c
===================================================================
--- linux.orig/fs/fuse/inode.c	2007-02-27 14:41:10.000000000 +0100
+++ linux/fs/fuse/inode.c	2007-02-27 14:41:10.000000000 +0100
@@ -55,6 +55,7 @@ static struct inode *fuse_alloc_inode(st
 	fi->i_time = 0;
 	fi->nodeid = 0;
 	fi->nlookup = 0;
+	init_rwsem(&fi->truncate_sem);
 	fi->forget_req = fuse_request_alloc();
 	if (!fi->forget_req) {
 		kmem_cache_free(fuse_inode_cachep, inode);

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 12/22] fuse: fix page invalidation
  2007-02-27 23:14 [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support Miklos Szeredi
                   ` (10 preceding siblings ...)
  2007-02-27 23:14 ` [patch 11/22] fuse: add truncation semaphore Miklos Szeredi
@ 2007-02-27 23:14 ` Miklos Szeredi
  2007-02-27 23:14 ` [patch 13/22] fuse: add list of writable files to fuse_inode Miklos Szeredi
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Miklos Szeredi @ 2007-02-27 23:14 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: fuse_invalidate_fix.patch --]
[-- Type: text/plain, Size: 3414 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

Other than truncate, there are two cases, when fuse tries to get rid
of cached pages:

 a) in open, if KEEP_CACHE flag is not set)
 b) in getattr, if file size changed spontaneously

Until now invalidate_mapping_pages() were used, which didn't get rid
of mapped pages.  This is wrong, and becomes more wrong as dirty pages
are introduced.  So instead properly invalidate all pages with
invalidate_inode_pages2().  If the size of the file shrunk, truncate
pages outside the new size beforehand.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/fs/fuse/file.c
===================================================================
--- linux.orig/fs/fuse/file.c	2007-02-27 14:41:10.000000000 +0100
+++ linux/fs/fuse/file.c	2007-02-27 14:41:11.000000000 +0100
@@ -86,7 +86,7 @@ void fuse_finish_open(struct inode *inod
 	if (outarg->open_flags & FOPEN_DIRECT_IO)
 		file->f_op = &fuse_direct_io_file_operations;
 	if (!(outarg->open_flags & FOPEN_KEEP_CACHE))
-		invalidate_mapping_pages(inode->i_mapping, 0, -1);
+		invalidate_inode_pages2(inode->i_mapping);
 	ff->fh = outarg->fh;
 	file->private_data = fuse_file_get(ff);
 }
Index: linux/fs/fuse/inode.c
===================================================================
--- linux.orig/fs/fuse/inode.c	2007-02-27 14:41:10.000000000 +0100
+++ linux/fs/fuse/inode.c	2007-02-27 14:41:11.000000000 +0100
@@ -112,17 +112,22 @@ static int fuse_remount_fs(struct super_
 void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr)
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
-	if (S_ISREG(inode->i_mode) && i_size_read(inode) != attr->size)
-		invalidate_mapping_pages(inode->i_mapping, 0, -1);
+	loff_t oldsize = i_size_read(inode);
+
+	spin_lock(&fc->lock);
+	i_size_write(inode, attr->size);
+	spin_unlock(&fc->lock);
+	if (S_ISREG(inode->i_mode)) {
+		if (oldsize > attr->size)
+			fuse_vmtruncate(inode, attr->size);
+		invalidate_inode_pages2(inode->i_mapping);
+	}
 
 	inode->i_ino     = attr->ino;
 	inode->i_mode    = (inode->i_mode & S_IFMT) + (attr->mode & 07777);
 	inode->i_nlink   = attr->nlink;
 	inode->i_uid     = attr->uid;
 	inode->i_gid     = attr->gid;
-	spin_lock(&fc->lock);
-	i_size_write(inode, attr->size);
-	spin_unlock(&fc->lock);
 	inode->i_blocks  = attr->blocks;
 	inode->i_atime.tv_sec   = attr->atime;
 	inode->i_atime.tv_nsec  = attr->atimensec;
Index: linux/fs/fuse/dir.c
===================================================================
--- linux.orig/fs/fuse/dir.c	2007-02-27 14:41:10.000000000 +0100
+++ linux/fs/fuse/dir.c	2007-02-27 14:41:11.000000000 +0100
@@ -979,7 +979,7 @@ static void iattr_to_fattr(struct iattr 
 	}
 }
 
-static void fuse_vmtruncate(struct inode *inode, loff_t offset)
+void fuse_vmtruncate(struct inode *inode, loff_t offset)
 {
 	struct address_space *mapping = inode->i_mapping;
 	unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
Index: linux/fs/fuse/fuse_i.h
===================================================================
--- linux.orig/fs/fuse/fuse_i.h	2007-02-27 14:41:10.000000000 +0100
+++ linux/fs/fuse/fuse_i.h	2007-02-27 14:41:11.000000000 +0100
@@ -565,3 +565,8 @@ int fuse_ctl_add_conn(struct fuse_conn *
  * Remove connection from control filesystem
  */
 void fuse_ctl_remove_conn(struct fuse_conn *fc);
+
+/**
+ * Truncate inode
+ */
+void fuse_vmtruncate(struct inode *inode, loff_t offset);

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 13/22] fuse: add list of writable files to fuse_inode
  2007-02-27 23:14 [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support Miklos Szeredi
                   ` (11 preceding siblings ...)
  2007-02-27 23:14 ` [patch 12/22] fuse: fix page invalidation Miklos Szeredi
@ 2007-02-27 23:14 ` Miklos Szeredi
  2007-02-27 23:14 ` [patch 14/22] fuse: add helper for asynchronous writes Miklos Szeredi
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Miklos Szeredi @ 2007-02-27 23:14 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: fuse_write_files.patch --]
[-- Type: text/plain, Size: 3078 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

Each WRITE request must carry a valid file descriptor.  When a page is
written back from a memory mapping, the file through which the page
was dirtied is not available, so a new mechananism is needed to find a
suitable file in ->writepage(s).

A list of fuse_files is added to fuse_inode.  The file is removed from
the list in fuse_release().  The addition of the file to the list is
in a later patch.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/fs/fuse/file.c
===================================================================
--- linux.orig/fs/fuse/file.c	2007-02-27 14:41:11.000000000 +0100
+++ linux/fs/fuse/file.c	2007-02-27 14:41:11.000000000 +0100
@@ -53,6 +53,7 @@ struct fuse_file *fuse_file_alloc(void)
 			kfree(ff);
 			ff = NULL;
 		}
+		INIT_LIST_HEAD(&ff->write_entry);
 		atomic_set(&ff->count, 0);
 	}
 	return ff;
@@ -147,12 +148,18 @@ int fuse_release_common(struct inode *in
 {
 	struct fuse_file *ff = file->private_data;
 	if (ff) {
+		struct fuse_conn *fc = get_fuse_conn(inode);
+
 		fuse_release_fill(ff, get_node_id(inode), file->f_flags,
 				  isdir ? FUSE_RELEASEDIR : FUSE_RELEASE);
 
 		/* Hold vfsmount and dentry until release is finished */
 		ff->reserved_req->vfsmount = mntget(file->f_path.mnt);
 		ff->reserved_req->dentry = dget(file->f_path.dentry);
+
+		spin_lock(&fc->lock);
+		list_del(&ff->write_entry);
+		spin_unlock(&fc->lock);
 		/*
 		 * Normally this will send the RELEASE request,
 		 * however if some asynchronous READ or WRITE requests
Index: linux/fs/fuse/fuse_i.h
===================================================================
--- linux.orig/fs/fuse/fuse_i.h	2007-02-27 14:41:11.000000000 +0100
+++ linux/fs/fuse/fuse_i.h	2007-02-27 14:41:11.000000000 +0100
@@ -65,6 +65,9 @@ struct fuse_inode {
 	/** Time in jiffies until the file attributes are valid */
 	u64 i_time;
 
+	/** Files usable in writepage.  Protected by fc->lock */
+	struct list_head write_files;
+
 	/** Semaphore for synchronizing writes with truncate */
 	struct rw_semaphore truncate_sem;
 };
@@ -79,6 +82,9 @@ struct fuse_file {
 
 	/** Refcount */
 	atomic_t count;
+
+	/** Entry on inode's write_files list */
+	struct list_head write_entry;
 };
 
 /** One input argument of a request */
Index: linux/fs/fuse/inode.c
===================================================================
--- linux.orig/fs/fuse/inode.c	2007-02-27 14:41:11.000000000 +0100
+++ linux/fs/fuse/inode.c	2007-02-27 14:41:11.000000000 +0100
@@ -55,6 +55,7 @@ static struct inode *fuse_alloc_inode(st
 	fi->i_time = 0;
 	fi->nodeid = 0;
 	fi->nlookup = 0;
+	INIT_LIST_HEAD(&fi->write_files);
 	init_rwsem(&fi->truncate_sem);
 	fi->forget_req = fuse_request_alloc();
 	if (!fi->forget_req) {
@@ -68,6 +69,7 @@ static struct inode *fuse_alloc_inode(st
 static void fuse_destroy_inode(struct inode *inode)
 {
 	struct fuse_inode *fi = get_fuse_inode(inode);
+	BUG_ON(!list_empty(&fi->write_files));
 	if (fi->forget_req)
 		fuse_request_free(fi->forget_req);
 	kmem_cache_free(fuse_inode_cachep, inode);

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 14/22] fuse: add helper for asynchronous writes
  2007-02-27 23:14 [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support Miklos Szeredi
                   ` (12 preceding siblings ...)
  2007-02-27 23:14 ` [patch 13/22] fuse: add list of writable files to fuse_inode Miklos Szeredi
@ 2007-02-27 23:14 ` Miklos Szeredi
  2007-02-27 23:14 ` [patch 15/22] add non-owner variant of down_read_trylock() Miklos Szeredi
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Miklos Szeredi @ 2007-02-27 23:14 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: fuse_fill_write.patch --]
[-- Type: text/plain, Size: 3337 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

This patch adds a new helper function fuse_write_fill() which makes it
possible to send WRITE requests asynchronously.

A new flag for WRITE requests is also added which indicates that this
a write from the page cache, and not a "normal" file write.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/fs/fuse/file.c
===================================================================
--- linux.orig/fs/fuse/file.c	2007-02-27 14:41:11.000000000 +0100
+++ linux/fs/fuse/file.c	2007-02-27 14:41:11.000000000 +0100
@@ -440,30 +440,37 @@ out:
 	return err;
 }
 
-static size_t fuse_send_write(struct fuse_req *req, struct file *file,
-			      struct inode *inode, loff_t pos, size_t count)
-{
-	struct fuse_conn *fc = get_fuse_conn(inode);
-	struct fuse_file *ff = file->private_data;
-	struct fuse_write_in inarg;
-	struct fuse_write_out outarg;
-
-	memset(&inarg, 0, sizeof(struct fuse_write_in));
-	inarg.fh = ff->fh;
-	inarg.offset = pos;
-	inarg.size = count;
+static void fuse_write_fill(struct fuse_req *req, struct fuse_file *ff,
+			    struct inode *inode, loff_t pos, size_t count,
+			    int writepage)
+{
+	struct fuse_write_in *inarg = &req->misc.write.in;
+	struct fuse_write_out *outarg = &req->misc.write.out;
+
+	memset(inarg, 0, sizeof(struct fuse_write_in));
+	inarg->fh = ff->fh;
+	inarg->offset = pos;
+	inarg->size = count;
+	inarg->write_flags = writepage ? FUSE_WRITE_CACHE : 0;
 	req->in.h.opcode = FUSE_WRITE;
 	req->in.h.nodeid = get_node_id(inode);
 	req->in.argpages = 1;
 	req->in.numargs = 2;
 	req->in.args[0].size = sizeof(struct fuse_write_in);
-	req->in.args[0].value = &inarg;
+	req->in.args[0].value = inarg;
 	req->in.args[1].size = count;
 	req->out.numargs = 1;
 	req->out.args[0].size = sizeof(struct fuse_write_out);
-	req->out.args[0].value = &outarg;
+	req->out.args[0].value = outarg;
+}
+
+static size_t fuse_send_write(struct fuse_req *req, struct file *file,
+			      struct inode *inode, loff_t pos, size_t count)
+{
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	fuse_write_fill(req, file->private_data, inode, pos, count, 0);
 	request_send(fc, req);
-	return outarg.size;
+	return req->misc.write.out.size;
 }
 
 static int fuse_prepare_write(struct file *file, struct page *page,
Index: linux/fs/fuse/fuse_i.h
===================================================================
--- linux.orig/fs/fuse/fuse_i.h	2007-02-27 14:41:11.000000000 +0100
+++ linux/fs/fuse/fuse_i.h	2007-02-27 14:41:11.000000000 +0100
@@ -216,6 +216,10 @@ struct fuse_req {
 		struct fuse_init_in init_in;
 		struct fuse_init_out init_out;
 		struct fuse_read_in read_in;
+		struct {
+			struct fuse_write_in in;
+			struct fuse_write_out out;
+		} write;
 		struct fuse_lk_in lk_in;
 	} misc;
 
Index: linux/include/linux/fuse.h
===================================================================
--- linux.orig/include/linux/fuse.h	2007-02-27 14:40:54.000000000 +0100
+++ linux/include/linux/fuse.h	2007-02-27 14:41:11.000000000 +0100
@@ -97,6 +97,13 @@ struct fuse_file_lock {
  */
 #define FUSE_RELEASE_FLUSH	(1 << 0)
 
+/**
+ * WRITE flags
+ *
+ * FUSE_WRITE_CACHE: delayed write from page cache, file handle is guessed
+ */
+#define FUSE_WRITE_CACHE	(1 << 0)
+
 enum fuse_opcode {
 	FUSE_LOOKUP	   = 1,
 	FUSE_FORGET	   = 2,  /* no reply */

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 15/22] add non-owner variant of down_read_trylock()
  2007-02-27 23:14 [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support Miklos Szeredi
                   ` (13 preceding siblings ...)
  2007-02-27 23:14 ` [patch 14/22] fuse: add helper for asynchronous writes Miklos Szeredi
@ 2007-02-27 23:14 ` Miklos Szeredi
  2007-02-27 23:14 ` [patch 16/22] fuse: add fuse_writepage() function Miklos Szeredi
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Miklos Szeredi @ 2007-02-27 23:14 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: down_read_trylock_non_owner.patch --]
[-- Type: text/plain, Size: 1536 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

Needed by fuse writepage.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/include/linux/rwsem.h
===================================================================
--- linux.orig/include/linux/rwsem.h	2007-02-27 14:40:55.000000000 +0100
+++ linux/include/linux/rwsem.h	2007-02-27 14:41:09.000000000 +0100
@@ -82,11 +82,13 @@ extern void down_write_nested(struct rw_
  *   proper abstraction for this case is completions. ]
  */
 extern void down_read_non_owner(struct rw_semaphore *sem);
+extern int down_read_trylock_non_owner(struct rw_semaphore *sem);
 extern void up_read_non_owner(struct rw_semaphore *sem);
 #else
 # define down_read_nested(sem, subclass)		down_read(sem)
 # define down_write_nested(sem, subclass)	down_write(sem)
 # define down_read_non_owner(sem)		down_read(sem)
+# define down_read_trylock_non_owner(sem)	down_read_trylock(sem)
 # define up_read_non_owner(sem)			up_read(sem)
 #endif
 
Index: linux/kernel/rwsem.c
===================================================================
--- linux.orig/kernel/rwsem.c	2007-02-27 14:40:55.000000000 +0100
+++ linux/kernel/rwsem.c	2007-02-27 14:41:09.000000000 +0100
@@ -125,6 +125,13 @@ void down_read_non_owner(struct rw_semap
 
 EXPORT_SYMBOL(down_read_non_owner);
 
+int down_read_trylock_non_owner(struct rw_semaphore *sem)
+{
+	return __down_read_trylock(sem);
+}
+
+EXPORT_SYMBOL(down_read_trylock_non_owner);
+
 void down_write_nested(struct rw_semaphore *sem, int subclass)
 {
 	might_sleep();

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 16/22] fuse: add fuse_writepage() function
  2007-02-27 23:14 [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support Miklos Szeredi
                   ` (14 preceding siblings ...)
  2007-02-27 23:14 ` [patch 15/22] add non-owner variant of down_read_trylock() Miklos Szeredi
@ 2007-02-27 23:14 ` Miklos Szeredi
  2007-02-27 23:14 ` [patch 17/22] fuse: writable shared mmap support Miklos Szeredi
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Miklos Szeredi @ 2007-02-27 23:14 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: fuse_writepage.patch --]
[-- Type: text/plain, Size: 4952 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

Implement the ->writepage address space operation.  Be careful not to
block if the wbc->nonblocking flag is set.

Acquire the read-write truncation semaphore for read when allocating
the request.  Use the _non_owner variants, since the semaphore is held
until the asynchronous request is completed.

When allocating the request, use GFP_NOFS, since there's a danger of
recursion via direct page reclaim.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/fs/fuse/dev.c
===================================================================
--- linux.orig/fs/fuse/dev.c	2007-02-27 14:41:10.000000000 +0100
+++ linux/fs/fuse/dev.c	2007-02-27 14:41:12.000000000 +0100
@@ -41,7 +41,7 @@ static void fuse_request_init(struct fus
 
 struct fuse_req *fuse_request_alloc(void)
 {
-	struct fuse_req *req = kmem_cache_alloc(fuse_req_cachep, GFP_KERNEL);
+	struct fuse_req *req = kmem_cache_alloc(fuse_req_cachep, GFP_NOFS);
 	if (req)
 		fuse_request_init(req);
 	return req;
Index: linux/fs/fuse/file.c
===================================================================
--- linux.orig/fs/fuse/file.c	2007-02-27 14:41:11.000000000 +0100
+++ linux/fs/fuse/file.c	2007-02-27 14:41:12.000000000 +0100
@@ -11,6 +11,7 @@
 #include <linux/pagemap.h>
 #include <linux/slab.h>
 #include <linux/kernel.h>
+#include <linux/writeback.h>
 
 static const struct file_operations fuse_direct_io_file_operations;
 
@@ -644,6 +645,125 @@ static ssize_t fuse_direct_write(struct 
 	return res;
 }
 
+static void fuse_writepage_end(struct fuse_conn *fc, struct fuse_req *req)
+{
+	struct address_space *mapping = req->pages[0]->mapping;
+	struct fuse_inode *fi = get_fuse_inode(mapping->host);
+	int err = req->out.h.error;
+	if (!err && req->misc.write.out.size != req->misc.write.in.size)
+		err = -EIO;
+	mapping_set_error(mapping, err);
+	end_page_writeback(req->pages[0]);
+	up_read_non_owner(&fi->truncate_sem);
+	fuse_file_put(req->ff);
+	fuse_put_request(fc, req);
+}
+
+static struct fuse_req *fuse_get_req_wp(struct inode *inode,
+					struct writeback_control *wbc,
+					int *blocked)
+{
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_req *req;
+
+	*blocked = 0;
+	if (wbc->nonblocking && fc->blocked)
+		goto blocked;
+
+	atomic_inc(&fc->num_waiting);
+	if (!wbc->nonblocking)
+		wait_event(fc->blocked_waitq, !fc->blocked);
+
+	req = fuse_request_alloc();
+	if (!req) {
+		atomic_dec(&fc->num_waiting);
+		return NULL;
+	}
+	req->waiting = 1;
+
+	if (!wbc->nonblocking)
+		down_read_non_owner(&fi->truncate_sem);
+	else if (!down_read_trylock_non_owner(&fi->truncate_sem)) {
+		fuse_put_request(fc, req);
+		goto blocked;
+	}
+	spin_lock(&fc->lock);
+	BUG_ON(list_empty(&fi->write_files));
+	req->ff = fuse_file_get(list_entry(fi->write_files.next,
+					   struct fuse_file, write_entry));
+	spin_unlock(&fc->lock);
+	return req;
+
+ blocked:
+	*blocked = 1;
+	return NULL;
+}
+
+static int fuse_writepage_locked(struct page *page,
+				 struct writeback_control *wbc)
+{
+	struct inode *inode = page->mapping->host;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_req *req;
+	size_t count;
+	loff_t size;
+	unsigned long end_index;
+	int blocked;
+
+	req = fuse_get_req_wp(inode, wbc, &blocked);
+	if (!req) {
+		if (!blocked)
+			return -ENOMEM;
+		redirty_page_for_writepage(wbc, page);
+		return 0;
+	}
+
+	size = i_size_read(inode);
+	end_index = size >> PAGE_CACHE_SHIFT;
+
+	if (page->index < end_index)
+		count = PAGE_CACHE_SIZE;
+	else if (page->index == end_index)
+		count = size & (PAGE_CACHE_SIZE - 1);
+	else {
+		/*
+		 * It is possible that while waiting for the
+		 * truncation semaphore, the file was truncated.
+		 */
+		fuse_put_request(fc, req);
+		return 0;
+	}
+
+	req->num_pages = 1;
+	req->pages[0] = page;
+	req->page_offset = 0;
+	req->end = fuse_writepage_end;
+	fuse_write_fill(req, req->ff, inode, page_offset(page), count, 1);
+	set_page_writeback(page);
+	request_send_background(fc, req);
+	return 0;
+}
+
+static int fuse_writepage(struct page *page, struct writeback_control *wbc)
+{
+	int err = fuse_writepage_locked(page, wbc);
+	unlock_page(page);
+	return err;
+}
+
+static int fuse_launder_page(struct page *page)
+{
+	int err = 0;
+	if (clear_page_dirty_for_io(page)) {
+		struct writeback_control wbc = { .nonblocking = 0 };
+		err = fuse_writepage_locked(page, &wbc);
+		if (!err)
+			wait_on_page_writeback(page);
+	}
+	return err;
+}
+
 static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	if ((vma->vm_flags & VM_SHARED)) {
@@ -847,6 +967,8 @@ static const struct file_operations fuse
 
 static const struct address_space_operations fuse_file_aops  = {
 	.readpage	= fuse_readpage,
+	.writepage	= fuse_writepage,
+	.launder_page	= fuse_launder_page,
 	.prepare_write	= fuse_prepare_write,
 	.commit_write	= fuse_commit_write,
 	.readpages	= fuse_readpages,

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 17/22] fuse: writable shared mmap support
  2007-02-27 23:14 [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support Miklos Szeredi
                   ` (15 preceding siblings ...)
  2007-02-27 23:14 ` [patch 16/22] fuse: add fuse_writepage() function Miklos Szeredi
@ 2007-02-27 23:14 ` Miklos Szeredi
  2007-02-27 23:15 ` [patch 18/22] fuse: add fuse_writepages() function Miklos Szeredi
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Miklos Szeredi @ 2007-02-27 23:14 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: fuse_writable_mmap.patch --]
[-- Type: text/plain, Size: 2757 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

Change fuse_file_mmap() to allow shared writable mappings.  Change the
->set_page_dirty address space operation to __set_page_dirty_nobuffers.

In fuse_fsync() sync the inode's dirty data.

It is important, that after all writable file are closed, no more
dirty pages remain for the inode.  So write back dirty pages from
munmap().

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/fs/fuse/file.c
===================================================================
--- linux.orig/fs/fuse/file.c	2007-02-27 14:41:12.000000000 +0100
+++ linux/fs/fuse/file.c	2007-02-27 14:41:12.000000000 +0100
@@ -253,6 +253,11 @@ int fuse_fsync_common(struct file *file,
 	if (is_bad_inode(inode))
 		return -EIO;
 
+	/* Wait for all outstanding writes, before sending the FSYNC request */
+	err = write_inode_now(de->d_inode, 0);
+	if (err)
+		return err;
+
 	if ((!isdir && fc->no_fsync) || (isdir && fc->no_fsyncdir))
 		return 0;
 
@@ -764,21 +769,39 @@ static int fuse_launder_page(struct page
 	return err;
 }
 
-static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
+static void fuse_vma_close(struct vm_area_struct *vma)
 {
-	if ((vma->vm_flags & VM_SHARED)) {
-		if ((vma->vm_flags & VM_WRITE))
-			return -ENODEV;
-		else
-			vma->vm_flags &= ~VM_MAYWRITE;
-	}
-	return generic_file_mmap(file, vma);
+	/*
+	 * Write back dirty pages now, because there may not
+	 * be any suitable open files later
+	 */
+	filemap_write_and_wait(vma->vm_file->f_mapping);
 }
 
-static int fuse_set_page_dirty(struct page *page)
+static struct vm_operations_struct fuse_file_vm_ops = {
+	.close		= fuse_vma_close,
+	.nopage		= filemap_nopage,
+	.populate	= filemap_populate,
+};
+
+static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
-	printk("fuse_set_page_dirty: should not happen\n");
-	dump_stack();
+	if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) {
+		struct inode *inode = file->f_dentry->d_inode;
+		struct fuse_conn *fc = get_fuse_conn(inode);
+		struct fuse_inode *fi = get_fuse_inode(inode);
+		struct fuse_file *ff = file->private_data;
+		/*
+		 * file may be written through mmap, so chain it onto the
+		 * inodes's write_file list
+		 */
+		spin_lock(&fc->lock);
+		if (list_empty(&ff->write_entry))
+			list_add(&ff->write_entry, &fi->write_files);
+		spin_unlock(&fc->lock);
+	}
+	file_accessed(file);
+	vma->vm_ops = &fuse_file_vm_ops;
 	return 0;
 }
 
@@ -972,7 +995,7 @@ static const struct address_space_operat
 	.prepare_write	= fuse_prepare_write,
 	.commit_write	= fuse_commit_write,
 	.readpages	= fuse_readpages,
-	.set_page_dirty	= fuse_set_page_dirty,
+	.set_page_dirty	= __set_page_dirty_nobuffers,
 	.bmap		= fuse_bmap,
 };
 

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 18/22] fuse: add fuse_writepages() function
  2007-02-27 23:14 [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support Miklos Szeredi
                   ` (16 preceding siblings ...)
  2007-02-27 23:14 ` [patch 17/22] fuse: writable shared mmap support Miklos Szeredi
@ 2007-02-27 23:15 ` Miklos Szeredi
  2007-02-27 23:15 ` [patch 19/22] export sync_sb() to modules Miklos Szeredi
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Miklos Szeredi @ 2007-02-27 23:15 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: fuse_writepages.patch --]
[-- Type: text/plain, Size: 3986 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

Implement the ->writepages address space operation.  This is very
similar to fuse_writepage(), but batches multiple pages into a single
request.

It reuses the fuse_fill_data structure currently used by
fuse_readpages().

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/fs/fuse/file.c
===================================================================
--- linux.orig/fs/fuse/file.c	2007-02-27 14:41:12.000000000 +0100
+++ linux/fs/fuse/file.c	2007-02-27 14:41:12.000000000 +0100
@@ -650,15 +650,17 @@ static ssize_t fuse_direct_write(struct 
 	return res;
 }
 
-static void fuse_writepage_end(struct fuse_conn *fc, struct fuse_req *req)
+static void fuse_writepages_end(struct fuse_conn *fc, struct fuse_req *req)
 {
+	int i;
 	struct address_space *mapping = req->pages[0]->mapping;
 	struct fuse_inode *fi = get_fuse_inode(mapping->host);
 	int err = req->out.h.error;
 	if (!err && req->misc.write.out.size != req->misc.write.in.size)
 		err = -EIO;
 	mapping_set_error(mapping, err);
-	end_page_writeback(req->pages[0]);
+	for (i = 0; i < req->num_pages; i++)
+		end_page_writeback(req->pages[i]);
 	up_read_non_owner(&fi->truncate_sem);
 	fuse_file_put(req->ff);
 	fuse_put_request(fc, req);
@@ -743,7 +745,7 @@ static int fuse_writepage_locked(struct 
 	req->num_pages = 1;
 	req->pages[0] = page;
 	req->page_offset = 0;
-	req->end = fuse_writepage_end;
+	req->end = fuse_writepages_end;
 	fuse_write_fill(req, req->ff, inode, page_offset(page), count, 1);
 	set_page_writeback(page);
 	request_send_background(fc, req);
@@ -784,6 +786,80 @@ static struct vm_operations_struct fuse_
 	.populate	= filemap_populate,
 };
 
+static void fuse_send_writepages(struct fuse_req *req, struct inode *inode)
+{
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	loff_t pos = page_offset(req->pages[0]);
+	loff_t size = i_size_read(inode);
+
+	/* File truncated from under us? */
+	if (size <= pos)
+		fuse_writepages_end(fc, req);
+	else {
+		size_t count = req->num_pages << PAGE_CACHE_SHIFT;
+		if (pos + count > size)
+			count = size - pos;
+		fuse_write_fill(req, req->ff, inode, pos, count, 1);
+		req->end = fuse_writepages_end;
+		request_send_background(fc, req);
+	}
+}
+
+static int fuse_writepages_fill(struct page *page,
+				struct writeback_control *wbc, void *_data)
+{
+	struct fuse_fill_data *data = _data;
+	struct fuse_req *req = data->req;
+	struct inode *inode = data->inode;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	if (req &&
+	    (req->num_pages == FUSE_MAX_PAGES_PER_REQ ||
+	     (req->num_pages + 1) * PAGE_CACHE_SIZE > fc->max_write ||
+	     req->pages[req->num_pages - 1]->index + 1 != page->index)) {
+		fuse_send_writepages(req, inode);
+		req = NULL;
+	}
+	if (!req) {
+		int blocked;
+		data->req = req = fuse_get_req_wp(inode, wbc, &blocked);
+		if (!req) {
+			int err = 0;
+			if (blocked)
+				redirty_page_for_writepage(wbc, page);
+			else
+				err = -ENOMEM;
+			unlock_page(page);
+			return err;
+		}
+	}
+	req->pages[req->num_pages] = page;
+	req->num_pages ++;
+	set_page_writeback(page);
+	unlock_page(page);
+	return 0;
+}
+
+static int fuse_writepages(struct address_space *mapping,
+			   struct writeback_control *wbc)
+{
+	struct inode *inode = mapping->host;
+	struct fuse_fill_data data;
+	int err;
+
+	if (is_bad_inode(inode))
+		return -EIO;
+
+	data.inode = inode;
+	data.req = NULL;
+
+	err = write_cache_pages(mapping, wbc, fuse_writepages_fill, &data);
+	if (data.req)
+		fuse_send_writepages(data.req, inode);
+
+	return err;
+}
+
 static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) {
@@ -995,6 +1071,7 @@ static const struct address_space_operat
 	.prepare_write	= fuse_prepare_write,
 	.commit_write	= fuse_commit_write,
 	.readpages	= fuse_readpages,
+	.writepages	= fuse_writepages,
 	.set_page_dirty	= __set_page_dirty_nobuffers,
 	.bmap		= fuse_bmap,
 };

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 19/22] export sync_sb() to modules
  2007-02-27 23:14 [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support Miklos Szeredi
                   ` (17 preceding siblings ...)
  2007-02-27 23:15 ` [patch 18/22] fuse: add fuse_writepages() function Miklos Szeredi
@ 2007-02-27 23:15 ` Miklos Szeredi
  2007-02-27 23:15 ` [patch 20/22] fuse: make dirty stats available Miklos Szeredi
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Miklos Szeredi @ 2007-02-27 23:15 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: sync_sb.patch --]
[-- Type: text/plain, Size: 1376 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

Create a function sync_sb() and export it to modules.  This is the
generic interface for writing back dirty data from a single
superblock.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/fs/fs-writeback.c
===================================================================
--- linux.orig/fs/fs-writeback.c	2007-02-27 14:40:54.000000000 +0100
+++ linux/fs/fs-writeback.c	2007-02-27 14:41:12.000000000 +0100
@@ -400,6 +400,14 @@ sync_sb_inodes(struct super_block *sb, s
 	return;		/* Leave any unwritten inodes on s_io */
 }
 
+void sync_sb(struct super_block *sb, struct writeback_control *wbc)
+{
+	spin_lock(&inode_lock);
+	sync_sb_inodes(sb, wbc);
+	spin_unlock(&inode_lock);
+}
+EXPORT_SYMBOL(sync_sb);
+
 /*
  * Start writeback of dirty pagecache data against all unlocked inodes.
  *
Index: linux/include/linux/fs.h
===================================================================
--- linux.orig/include/linux/fs.h	2007-02-27 14:41:09.000000000 +0100
+++ linux/include/linux/fs.h	2007-02-27 14:41:12.000000000 +0100
@@ -1240,6 +1240,7 @@ static inline void file_accessed(struct 
 }
 
 int sync_inode(struct inode *inode, struct writeback_control *wbc);
+void sync_sb(struct super_block *sb, struct writeback_control *wbc);
 
 /**
  * struct export_operations - for nfsd to communicate with file systems

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 20/22] fuse: make dirty stats available
  2007-02-27 23:14 [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support Miklos Szeredi
                   ` (18 preceding siblings ...)
  2007-02-27 23:15 ` [patch 19/22] export sync_sb() to modules Miklos Szeredi
@ 2007-02-27 23:15 ` Miklos Szeredi
  2007-02-27 23:15 ` [patch 21/22] fuse: limit dirty pages Miklos Szeredi
  2007-02-27 23:15 ` [patch 22/22] fuse: allow big write requests Miklos Szeredi
  21 siblings, 0 replies; 32+ messages in thread
From: Miklos Szeredi @ 2007-02-27 23:15 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: fuse_queue_stats.patch --]
[-- Type: text/plain, Size: 3840 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

Make per-filesystem statistics about dirty and under-writeback pages
available through the fuse control filesystem.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/fs/fuse/control.c
===================================================================
--- linux.orig/fs/fuse/control.c	2007-02-27 14:40:54.000000000 +0100
+++ linux/fs/fuse/control.c	2007-02-27 14:41:13.000000000 +0100
@@ -41,24 +41,60 @@ static ssize_t fuse_conn_abort_write(str
 	return count;
 }
 
-static ssize_t fuse_conn_waiting_read(struct file *file, char __user *buf,
-				      size_t len, loff_t *ppos)
+enum fusectl_type {
+	FUSECTL_WAITING,
+	FUSECTL_NR_DIRTY,
+	FUSECTL_NR_WRITEBACK
+};
+
+static ssize_t fuse_conn_read(struct file *file, char __user *buf,
+			      size_t len, loff_t *ppos, enum fusectl_type sel)
 {
 	char tmp[32];
 	size_t size;
 
 	if (!*ppos) {
+		long val = 0;
 		struct fuse_conn *fc = fuse_ctl_file_conn_get(file);
 		if (!fc)
 			return 0;
 
-		file->private_data=(void *)(long)atomic_read(&fc->num_waiting);
+		switch (sel) {
+		case FUSECTL_WAITING:
+			val = atomic_read(&fc->num_waiting);
+			break;
+		case FUSECTL_NR_DIRTY:
+			val = atomic_long_read(&fc->bdi.nr_dirty);
+			break;
+		case FUSECTL_NR_WRITEBACK:
+			val = atomic_long_read(&fc->bdi.nr_writeback);
+			break;
+		}
+		file->private_data = (void *) val;
 		fuse_conn_put(fc);
 	}
 	size = sprintf(tmp, "%ld\n", (long)file->private_data);
 	return simple_read_from_buffer(buf, len, ppos, tmp, size);
 }
 
+static ssize_t fuse_conn_waiting_read(struct file *file, char __user *buf,
+				      size_t len, loff_t *ppos)
+{
+	return fuse_conn_read(file, buf, len, ppos, FUSECTL_WAITING);
+}
+
+static ssize_t fuse_conn_nr_dirty_read(struct file *file, char __user *buf,
+				       size_t len, loff_t *ppos)
+{
+	return fuse_conn_read(file, buf, len, ppos, FUSECTL_NR_DIRTY);
+}
+
+static ssize_t fuse_conn_nr_writeback_read(struct file *file, char __user *buf,
+					   size_t len, loff_t *ppos)
+{
+	return fuse_conn_read(file, buf, len, ppos, FUSECTL_NR_WRITEBACK);
+}
+
 static const struct file_operations fuse_ctl_abort_ops = {
 	.open = nonseekable_open,
 	.write = fuse_conn_abort_write,
@@ -69,6 +105,16 @@ static const struct file_operations fuse
 	.read = fuse_conn_waiting_read,
 };
 
+static const struct file_operations fuse_ctl_nr_dirty_ops = {
+	.open = nonseekable_open,
+	.read = fuse_conn_nr_dirty_read,
+};
+
+static const struct file_operations fuse_ctl_nr_writeback_ops = {
+	.open = nonseekable_open,
+	.read = fuse_conn_nr_writeback_read,
+};
+
 static struct dentry *fuse_ctl_add_dentry(struct dentry *parent,
 					  struct fuse_conn *fc,
 					  const char *name,
@@ -127,7 +173,11 @@ int fuse_ctl_add_conn(struct fuse_conn *
 	if (!fuse_ctl_add_dentry(parent, fc, "waiting", S_IFREG | 0400, 1,
 				NULL, &fuse_ctl_waiting_ops) ||
 	    !fuse_ctl_add_dentry(parent, fc, "abort", S_IFREG | 0200, 1,
-				 NULL, &fuse_ctl_abort_ops))
+				 NULL, &fuse_ctl_abort_ops) ||
+	    !fuse_ctl_add_dentry(parent, fc, "nr_dirty", S_IFREG | 0444, 1,
+				 NULL, &fuse_ctl_nr_dirty_ops) ||
+	    !fuse_ctl_add_dentry(parent, fc, "nr_writeback", S_IFREG | 0444, 1,
+				 NULL, &fuse_ctl_nr_writeback_ops))
 		goto err;
 
 	return 0;
Index: linux/fs/fuse/fuse_i.h
===================================================================
--- linux.orig/fs/fuse/fuse_i.h	2007-02-27 14:41:11.000000000 +0100
+++ linux/fs/fuse/fuse_i.h	2007-02-27 14:41:13.000000000 +0100
@@ -30,7 +30,7 @@
 #define FUSE_NAME_MAX 1024
 
 /** Number of dentries for each connection in the control filesystem */
-#define FUSE_CTL_NUM_DENTRIES 3
+#define FUSE_CTL_NUM_DENTRIES 5
 
 /** If the FUSE_DEFAULT_PERMISSIONS flag is given, the filesystem
     module will check permissions based on the file mode.  Otherwise no

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 21/22] fuse: limit dirty pages
  2007-02-27 23:14 [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support Miklos Szeredi
                   ` (19 preceding siblings ...)
  2007-02-27 23:15 ` [patch 20/22] fuse: make dirty stats available Miklos Szeredi
@ 2007-02-27 23:15 ` Miklos Szeredi
  2007-02-27 23:15 ` [patch 22/22] fuse: allow big write requests Miklos Szeredi
  21 siblings, 0 replies; 32+ messages in thread
From: Miklos Szeredi @ 2007-02-27 23:15 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: fuse_dirty_limit.patch --]
[-- Type: text/plain, Size: 2803 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

Add a per-filesystem limit for the number of dirty pages.  If half the
limit is reached, background writeback is started.  If the limit is
reached, then start some writeback and wait until the the number goes
below the limit again.

The dirty limit is currently hardcoded at 1MB.  This limiting is
necessary, so that buggy or unfriendly filesystems don't hurt system
performance too badly.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/fs/fuse/file.c
===================================================================
--- linux.orig/fs/fuse/file.c	2007-02-27 14:41:12.000000000 +0100
+++ linux/fs/fuse/file.c	2007-02-27 14:41:13.000000000 +0100
@@ -780,10 +780,35 @@ static void fuse_vma_close(struct vm_are
 	filemap_write_and_wait(vma->vm_file->f_mapping);
 }
 
+static int fuse_page_mkwrite(struct vm_area_struct *vma, struct page *page)
+{
+	struct inode *inode = page->mapping->host;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	long nr_dirty = atomic_long_read(&fc->bdi.nr_dirty);
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_NONE,
+		.nr_to_write = FUSE_MAX_PAGES_PER_REQ,
+		.range_cyclic = 1,
+	};
+
+	while (nr_dirty >= fc->max_dirty) {
+		sync_sb(inode->i_sb, &wbc);
+		congestion_wait(WRITE, HZ / 10);
+		nr_dirty = atomic_long_read(&fc->bdi.nr_dirty);
+		wbc.nr_to_write = FUSE_MAX_PAGES_PER_REQ;
+	}
+	if (nr_dirty >= fc->max_dirty / 2) {
+		wbc.nonblocking = 1;
+		sync_sb(inode->i_sb, &wbc);
+	}
+	return 0;
+}
+
 static struct vm_operations_struct fuse_file_vm_ops = {
 	.close		= fuse_vma_close,
 	.nopage		= filemap_nopage,
 	.populate	= filemap_populate,
+	.page_mkwrite	= fuse_page_mkwrite,
 };
 
 static void fuse_send_writepages(struct fuse_req *req, struct inode *inode)
Index: linux/fs/fuse/fuse_i.h
===================================================================
--- linux.orig/fs/fuse/fuse_i.h	2007-02-27 14:41:13.000000000 +0100
+++ linux/fs/fuse/fuse_i.h	2007-02-27 14:41:13.000000000 +0100
@@ -397,6 +397,9 @@ struct fuse_conn {
 
 	/** Reserved request for the DESTROY message */
 	struct fuse_req *destroy_req;
+
+	/** Maximum number of dirty pages in this fs */
+	unsigned long max_dirty;
 };
 
 static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
Index: linux/fs/fuse/inode.c
===================================================================
--- linux.orig/fs/fuse/inode.c	2007-02-27 14:41:11.000000000 +0100
+++ linux/fs/fuse/inode.c	2007-02-27 14:41:13.000000000 +0100
@@ -425,6 +425,7 @@ static struct fuse_conn *new_conn(void)
 		fc->bdi.unplug_io_fn = default_unplug_io_fn;
 		fc->reqctr = 0;
 		fc->blocked = 1;
+		fc->max_dirty = (1024 * 1024) / PAGE_CACHE_SIZE;
 		get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
 	}
 	return fc;

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 22/22] fuse: allow big write requests
  2007-02-27 23:14 [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support Miklos Szeredi
                   ` (20 preceding siblings ...)
  2007-02-27 23:15 ` [patch 21/22] fuse: limit dirty pages Miklos Szeredi
@ 2007-02-27 23:15 ` Miklos Szeredi
  21 siblings, 0 replies; 32+ messages in thread
From: Miklos Szeredi @ 2007-02-27 23:15 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: fuse_bigwrite.patch --]
[-- Type: text/plain, Size: 8132 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

Up to now, file writes were split into page size WRITE requests.  This
is inefficient, since there are two context switches per request.

So allow bigger writes, but still do it synchronously.  Asynchronous
writeback would be even better, but is very difficult, because of the
way file attributes are handled in fuse.

Pages to be written back are collected in a per-file request.  When no
more pages fit in the request or the size of the write would exceed
the maximum write count, send the request and start a new.  The write,
and hence the per-file request is protected by i_mutex.

Pages are not dirtied, but changed to write-back state immediately.
If page is already being written back, then wait for writeback to
finish.

Change s_blocksize and s_blocksize_bits to the number of bytes that
fit in a single write request, so that userspace tools utilize the big
write requests.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/fs/fuse/file.c
===================================================================
--- linux.orig/fs/fuse/file.c	2007-02-27 20:45:07.000000000 +0100
+++ linux/fs/fuse/file.c	2007-02-27 21:58:06.000000000 +0100
@@ -56,6 +56,7 @@ struct fuse_file *fuse_file_alloc(void)
 		}
 		INIT_LIST_HEAD(&ff->write_entry);
 		atomic_set(&ff->count, 0);
+		ff->write_req = NULL;
 	}
 	return ff;
 }
@@ -91,6 +92,12 @@ void fuse_finish_open(struct inode *inod
 		invalidate_inode_pages2(inode->i_mapping);
 	ff->fh = outarg->fh;
 	file->private_data = fuse_file_get(ff);
+	/*
+	 * Writes are synchronous anyway, and the page syncing in
+	 * generic_file_aio_write_nolock() interferes with our special
+	 * batched write thingy
+	 */
+	file->f_flags &= ~O_SYNC;
 }
 
 int fuse_open_common(struct inode *inode, struct file *file, int isdir)
@@ -486,44 +493,135 @@ static int fuse_prepare_write(struct fil
 	return 0;
 }
 
+static int fuse_send_write_chunk(struct fuse_file *ff, struct inode *inode)
+{
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_req *req = ff->write_req;
+	struct fuse_write_in *inarg = &req->misc.write.in;
+	struct fuse_write_out *outarg = &req->misc.write.out;
+	loff_t pos = inarg->offset + inarg->size;
+	int err;
+	int i;
+
+	req->in.args[1].size = inarg->size;
+	request_send(fc, req);
+	err = req->out.h.error;
+	if (!err && outarg->size != inarg->size)
+		err = -EIO;
+	for (i = 0; i < req->num_pages; i++) {
+		struct page *page = req->pages[i];
+		if (err)
+			ClearPageUptodate(page);
+		end_page_writeback(page);
+	}
+	fuse_put_request(fc, req);
+	ff->write_req = NULL;
+	if (!err) {
+		spin_lock(&fc->lock);
+		if (pos > inode->i_size)
+			i_size_write(inode, pos);
+		spin_unlock(&fc->lock);
+	}
+	fuse_invalidate_attr(inode);
+	return err;
+}
+
+static int fuse_alloc_write_chunk(struct fuse_file *ff, struct inode *inode,
+				  struct page *page, unsigned offset)
+{
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_req *req = fuse_get_req(fc);
+	if (IS_ERR(req)) {
+		ClearPageUptodate(page);
+		return PTR_ERR(req);
+	}
+	fuse_write_fill(req, ff, inode, page_offset(page) + offset, 0, 0);
+	req->page_offset = offset;
+	ff->write_req = req;
+	return 0;
+}
+
+/*
+ * Fuse wants synchronous, interruptible writes, so this is slightly
+ * unconventional.
+ *
+ * Instead of just dirtying pages, they are accumulated in
+ * ff->write_req, in WriteBack state.  When the request is full or the
+ * number of bytes exceeds max_write, then the request is sent and a
+ * new one is started.
+ */
 static int fuse_commit_write(struct file *file, struct page *page,
 			     unsigned offset, unsigned to)
 {
 	int err;
-	size_t nres;
 	unsigned count = to - offset;
 	struct inode *inode = page->mapping->host;
 	struct fuse_conn *fc = get_fuse_conn(inode);
-	loff_t pos = page_offset(page) + offset;
-	struct fuse_req *req;
+	struct fuse_file *ff = file->private_data;
+	struct fuse_req *req = ff->write_req;
 
 	if (is_bad_inode(inode))
 		return -EIO;
 
-	req = fuse_get_req(fc);
-	if (IS_ERR(req))
-		return PTR_ERR(req);
+	BUG_ON(!count);
+	if (req) {
+		struct fuse_write_in *inarg = &req->misc.write.in;
+
+		/* Writes are always contiguous */
+		BUG_ON(inarg->offset+inarg->size != page_offset(page)+offset);
+
+		if ((!offset && req->num_pages == FUSE_MAX_PAGES_PER_REQ) ||
+		    inarg->size + count > fc->max_write) {
+			err = fuse_send_write_chunk(ff, inode);
+			if (err)
+				return err;
+		}
+	}
+	if (!ff->write_req) {
+		err = fuse_alloc_write_chunk(ff, inode, page, offset);
+		if (err)
+			return err;
+	}
+	req = ff->write_req;
 
-	req->num_pages = 1;
-	req->pages[0] = page;
-	req->page_offset = offset;
-	nres = fuse_send_write(req, file, inode, pos, count);
-	err = req->out.h.error;
-	fuse_put_request(fc, req);
-	if (!err && nres != count)
-		err = -EIO;
-	if (!err) {
-		pos += count;
-		spin_lock(&fc->lock);
-		if (pos > inode->i_size)
-			i_size_write(inode, pos);
-		spin_unlock(&fc->lock);
+	/* Whole page has been overwritten, it's now clean and up-to-date */
+	if (to == PAGE_CACHE_SIZE &&
+	    (!req->page_offset || req->num_pages > 1)) {
+		clear_page_dirty_for_io(page);
+		SetPageUptodate(page);
+	}
 
-		if (offset == 0 && to == PAGE_CACHE_SIZE)
-			SetPageUptodate(page);
+	if (!req->num_pages || req->pages[req->num_pages - 1] != page) {
+		req->pages[req->num_pages] = page;
+		req->num_pages++;
+		wait_on_page_writeback(page);
+		set_page_writeback(page);
 	}
-	fuse_invalidate_attr(inode);
-	return err;
+	req->misc.write.in.size += count;
+	return 0;
+}
+
+static ssize_t fuse_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
+				   unsigned long nr_segs, loff_t pos)
+{
+	ssize_t ret;
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file->f_mapping->host;
+	struct fuse_file *ff = file->private_data;
+	BUG_ON(iocb->ki_pos != pos);
+
+	mutex_lock(&inode->i_mutex);
+	BUG_ON(ff->write_req);
+	ret = generic_file_aio_write_nolock(iocb, iov, nr_segs, iocb->ki_pos);
+	if (ff->write_req) {
+		ssize_t written = ff->write_req->misc.write.in.offset - pos;
+		int err = fuse_send_write_chunk(ff, inode);
+		if (err)
+			ret = written ? written : err;
+	}
+	mutex_unlock(&inode->i_mutex);
+
+	return ret;
 }
 
 static void fuse_release_user_pages(struct fuse_req *req, int write)
@@ -1067,7 +1165,7 @@ static const struct file_operations fuse
 	.read		= do_sync_read,
 	.aio_read	= generic_file_aio_read,
 	.write		= do_sync_write,
-	.aio_write	= generic_file_aio_write,
+	.aio_write	= fuse_file_aio_write,
 	.mmap		= fuse_file_mmap,
 	.open		= fuse_open,
 	.flush		= fuse_flush,
Index: linux/fs/fuse/fuse_i.h
===================================================================
--- linux.orig/fs/fuse/fuse_i.h	2007-02-27 20:45:07.000000000 +0100
+++ linux/fs/fuse/fuse_i.h	2007-02-27 20:46:51.000000000 +0100
@@ -85,6 +85,9 @@ struct fuse_file {
 
 	/** Entry on inode's write_files list */
 	struct list_head write_entry;
+
+	/** Request used in synchronous writes */
+	struct fuse_req *write_req;
 };
 
 /** One input argument of a request */
Index: linux/fs/fuse/inode.c
===================================================================
--- linux.orig/fs/fuse/inode.c	2007-02-27 20:45:07.000000000 +0100
+++ linux/fs/fuse/inode.c	2007-02-27 21:57:19.000000000 +0100
@@ -108,6 +108,9 @@ static int fuse_remount_fs(struct super_
 	if (*flags & MS_MANDLOCK)
 		return -EINVAL;
 
+	/* Writes are synchronous anyway, also see O_SYNC */
+	*flags &= ~MS_SYNCHRONOUS;
+
 	return 0;
 }
 
@@ -542,6 +545,9 @@ static int fuse_fill_super(struct super_
 	if (sb->s_flags & MS_MANDLOCK)
 		return -EINVAL;
 
+	/* Writes are synchronous anyway, also see O_SYNC */
+	sb->s_flags &= ~MS_SYNCHRONOUS;
+
 	if (!parse_fuse_opt((char *) data, &d, is_bdev))
 		return -EINVAL;
 
@@ -551,8 +557,8 @@ static int fuse_fill_super(struct super_
 			return -EINVAL;
 #endif
 	} else {
-		sb->s_blocksize = PAGE_CACHE_SIZE;
-		sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
+		sb->s_blocksize = PAGE_CACHE_SIZE * FUSE_MAX_PAGES_PER_REQ;
+		sb->s_blocksize_bits = ilog2(sb->s_blocksize);
 	}
 	sb->s_magic = FUSE_SUPER_MAGIC;
 	sb->s_op = &fuse_super_operations;

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 01/22] update ctime and mtime for mmaped write
  2007-02-27 23:14 ` [patch 01/22] update ctime and mtime for mmaped write Miklos Szeredi
@ 2007-02-28 14:16   ` Peter Staubach
  2007-02-28 17:06     ` Miklos Szeredi
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Staubach @ 2007-02-28 14:16 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-fsdevel

Miklos Szeredi wrote:
> From: Miklos Szeredi <mszeredi@suse.cz>
>
> Changes:
>  o moved check from __fput() to remove_vma(), which is more logical
>  o changed set_page_dirty() to set_page_dirty_mapping in hugetlb.c
>  o cleaned up #ifdef CONFIG_BLOCK mess
>
>
> This patch makes writing to shared memory mappings update st_ctime and
> st_mtime as defined by SUSv3:
>
>    The st_ctime and st_mtime fields of a file that is mapped with
>    MAP_SHARED and PROT_WRITE shall be marked for update at some point
>    in the interval between a write reference to the mapped region and
>    the next call to msync() with MS_ASYNC or MS_SYNC for that portion
>    of the file by any process. If there is no such call and if the
>    underlying file is modified as a result of a write reference, then
>    these fields shall be marked for update at some time after the
>    write reference.
>
> A new address_space flag is introduced: AS_CMTIME.  This is set each
> time a page is dirtied through a userspace memory mapping.  This
> includes write accesses via get_user_pages().
>
> Note, the flag is set unconditionally, even if the page is already
> dirty.  This is important, because the page might have been dirtied
> earlier by a non-mmap write.
>
> This flag is checked in msync() and munmap()/mremap(), and if set, the
> file times are updated and the flag is cleared
>
> The flag is also cleared, if the time update is triggered by a normal
> write.  This is not mandated by the standard, but seems to be a sane
> thing to do.
>
> Fixes Novell Bugzilla #206431.
>
> Inspired by Peter Staubach's patch and the resulting comments.
>
> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>

These change still have the undesirable property that although the
modified pages may be flushed to stable storage, the metadata on
the file will not be updated until the application takes positive
action.  This is permissible given the current wording in the
specifications, but it would be much more desirable if sync(2),
fsync(P), or the inode being written out due to normal system
activity would also cause the metadata to be updated.

Perhaps the setting of the flag could be checked in some places
like __sync_single_inode() and do_fsync()?

    Thanx...

       ps

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 01/22] update ctime and mtime for mmaped write
  2007-02-28 14:16   ` Peter Staubach
@ 2007-02-28 17:06     ` Miklos Szeredi
  2007-02-28 17:21       ` Peter Staubach
  0 siblings, 1 reply; 32+ messages in thread
From: Miklos Szeredi @ 2007-02-28 17:06 UTC (permalink / raw)
  To: staubach; +Cc: akpm, linux-kernel, linux-fsdevel

> These change still have the undesirable property that although the
> modified pages may be flushed to stable storage, the metadata on
> the file will not be updated until the application takes positive
> action.  This is permissible given the current wording in the
> specifications, but it would be much more desirable if sync(2),
> fsync(P), or the inode being written out due to normal system
> activity would also cause the metadata to be updated.
> 
> Perhaps the setting of the flag could be checked in some places
> like __sync_single_inode() and do_fsync()?

I don't see the point in updating the timestamp from these functions.

The file isn't _modified_ by sync() or fsync().  Just as it's not
modified by stat().

sync() and fsync() do cache->disk, while the file itself stays the
same.

OTOH msync(MS_ASYNC) does memory->file, which is a conceptually file
modifying operation.  OK, msync(MS_ASYNC) is actually a no-op on
2.6.18+, but that's purely an implementation detail and no application
should be relying on it.

Before 2.6.18 sync() or fsync() acually didn't flush data written
through a shared mapping to disk, only msync(MS_SYNC), because the
dirty state was only available in the page tables, not in the page or
the inode.

Miklos

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 01/22] update ctime and mtime for mmaped write
  2007-02-28 17:06     ` Miklos Szeredi
@ 2007-02-28 17:21       ` Peter Staubach
  2007-02-28 17:51         ` Miklos Szeredi
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Staubach @ 2007-02-28 17:21 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-fsdevel

Miklos Szeredi wrote:
>> These change still have the undesirable property that although the
>> modified pages may be flushed to stable storage, the metadata on
>> the file will not be updated until the application takes positive
>> action.  This is permissible given the current wording in the
>> specifications, but it would be much more desirable if sync(2),
>> fsync(P), or the inode being written out due to normal system
>> activity would also cause the metadata to be updated.
>>
>> Perhaps the setting of the flag could be checked in some places
>> like __sync_single_inode() and do_fsync()?
>>     
>
> I don't see the point in updating the timestamp from these functions.
>
> The file isn't _modified_ by sync() or fsync().  Just as it's not
> modified by stat().
>
> sync() and fsync() do cache->disk, while the file itself stays the
> same.
>
> OTOH msync(MS_ASYNC) does memory->file, which is a conceptually file
> modifying operation.  OK, msync(MS_ASYNC) is actually a no-op on
> 2.6.18+, but that's purely an implementation detail and no application
> should be relying on it.
>
> Before 2.6.18 sync() or fsync() acually didn't flush data written
> through a shared mapping to disk, only msync(MS_SYNC), because the
> dirty state was only available in the page tables, not in the page or
> the inode.

While these entry points do not actually modify the file itself,
as was pointed out, they are handy points at which the kernel gains
control and could actually notice that the contents of the file are
no longer the same as they were, ie. modified.

 From the operating system viewpoint, this is where the semantics of
modification to file contents via mmap differs from the semantics of
modification to file contents via write(2).

It is desirable for the file times to be updated as quickly as
possible after the actual modification has occurred.  However, this
can only happen when the kernel has a chance to gain control and
either look or notice that a page has been modified.

---

A better design for all of this would be to update the file times
and mark the inode as needing to be written out when a page fault
is taken for a page which either does not exist or needs to be made
writable and that page is part of an appropriate style mapping.

       ps

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 01/22] update ctime and mtime for mmaped write
  2007-02-28 17:21       ` Peter Staubach
@ 2007-02-28 17:51         ` Miklos Szeredi
  2007-02-28 20:01           ` Peter Staubach
  0 siblings, 1 reply; 32+ messages in thread
From: Miklos Szeredi @ 2007-02-28 17:51 UTC (permalink / raw)
  To: staubach; +Cc: akpm, linux-kernel, linux-fsdevel

> While these entry points do not actually modify the file itself,
> as was pointed out, they are handy points at which the kernel gains
> control and could actually notice that the contents of the file are
> no longer the same as they were, ie. modified.
> 
>  From the operating system viewpoint, this is where the semantics of
> modification to file contents via mmap differs from the semantics of
> modification to file contents via write(2).
> 
> It is desirable for the file times to be updated as quickly as
> possible after the actual modification has occurred.

I disagree.

You don't worry about the timestamp being updated _during_ a large
write() call, even though the file is constantly being modified.

You think of write() as something instantaneous, while you think of
writing to a shared mapping, then doing msync() as something taking a
long time.  In actual fact both of these are basically equivalent
operations, the differences being, that you can easily modify
non-contiguous parts of a file with mmap, while you can't do that with
write.  The disadvantage from mmap comes from the cost of setting up
the page tables and handling the faults.

Think of it this way:

  shared mmap write + msync(MS_ASYNC)  ==  write()
  msync(MS_ASYNC) + fsync()  ==  msync(MS_SYNC)

> A better design for all of this would be to update the file times
> and mark the inode as needing to be written out when a page fault
> is taken for a page which either does not exist or needs to be made
> writable and that page is part of an appropriate style mapping.

I think this would just be a waste of CPU.

Miklos

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 01/22] update ctime and mtime for mmaped write
  2007-02-28 17:51         ` Miklos Szeredi
@ 2007-02-28 20:01           ` Peter Staubach
  2007-02-28 20:35             ` Miklos Szeredi
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Staubach @ 2007-02-28 20:01 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-fsdevel

Miklos Szeredi wrote:
>> While these entry points do not actually modify the file itself,
>> as was pointed out, they are handy points at which the kernel gains
>> control and could actually notice that the contents of the file are
>> no longer the same as they were, ie. modified.
>>
>>  From the operating system viewpoint, this is where the semantics of
>> modification to file contents via mmap differs from the semantics of
>> modification to file contents via write(2).
>>
>> It is desirable for the file times to be updated as quickly as
>> possible after the actual modification has occurred.
>>     
>
> I disagree.
>
> You don't worry about the timestamp being updated _during_ a large
> write() call, even though the file is constantly being modified.
>
>   

No, but you do worry about the timestamps being updated after
every write() call, no matter how large or small.

> You think of write() as something instantaneous, while you think of
> writing to a shared mapping, then doing msync() as something taking a
> long time.  In actual fact both of these are basically equivalent
> operations, the differences being, that you can easily modify
> non-contiguous parts of a file with mmap, while you can't do that with
> write.  The disadvantage from mmap comes from the cost of setting up
> the page tables and handling the faults.
>
> Think of it this way:
>
>   shared mmap write + msync(MS_ASYNC)  ==  write()
>   msync(MS_ASYNC) + fsync()  ==  msync(MS_SYNC)
>
>   

I don't believe that this is a valid characterization because the
changes to the contents of the file, made through the mmap'd region,
are immediately visible to any and all other applications accessing
the file.  Since the contents of the file are changing, then so
should the timestamps to reflect this.

>> A better design for all of this would be to update the file times
>> and mark the inode as needing to be written out when a page fault
>> is taken for a page which either does not exist or needs to be made
>> writable and that page is part of an appropriate style mapping.
>>     
>
> I think this would just be a waste of CPU.

I think that we are going to have to agree to disagree because
I don't agree either with your characterizations of the desirable
semantics associated with shared mmap or that maintaining the
correctness in the system is a waste of CPU.

I view mmap as a way for an application to treat the contents of
a file as another segment in its address space.  This allows it to
manipulate the contents of a file without incurring the overhead
of the read and write system calls and the double buffering that
naturally occurs with those system calls.  I think that:

    char *p = mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    *p = 1;
    *(p + 4096) = 2;

should have the same effect as:

    char c = 1;
    pwrite(fd, &c, 1, 0);
    c = 2;
    pwrite(fd, &c, 1, 4096);

Clearly, the two can't be equivalent since the operating system
can only become involved at certain times in order to update the
timestamps.  That's why there are specifications about the
timestamps for things like msync.  They should be as close as
possible though.

However, since I seem to be the only one presenting a different
viewpoint, then I will agree to disagree and commit.  I will see
if I can sell your semantics to my customer and find out if that
will satisfy them.

    Thanx...

       ps

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 01/22] update ctime and mtime for mmaped write
  2007-02-28 20:01           ` Peter Staubach
@ 2007-02-28 20:35             ` Miklos Szeredi
  2007-02-28 20:58               ` Miklos Szeredi
  0 siblings, 1 reply; 32+ messages in thread
From: Miklos Szeredi @ 2007-02-28 20:35 UTC (permalink / raw)
  To: staubach; +Cc: miklos, akpm, linux-kernel, linux-fsdevel

> >> While these entry points do not actually modify the file itself,
> >> as was pointed out, they are handy points at which the kernel gains
> >> control and could actually notice that the contents of the file are
> >> no longer the same as they were, ie. modified.
> >>
> >>  From the operating system viewpoint, this is where the semantics of
> >> modification to file contents via mmap differs from the semantics of
> >> modification to file contents via write(2).
> >>
> >> It is desirable for the file times to be updated as quickly as
> >> possible after the actual modification has occurred.
> >>     
> >
> > I disagree.
> >
> > You don't worry about the timestamp being updated _during_ a large
> > write() call, even though the file is constantly being modified.
> >
> >   
> 
> No, but you do worry about the timestamps being updated after
> every write() call, no matter how large or small.

Right.  All I'm saying is that just writing to a shared mapping
without calling msync() is similar to a write() which hasn't yet
finished.  In both cases, you have a modified file, without a modified
timestamp.

> > You think of write() as something instantaneous, while you think of
> > writing to a shared mapping, then doing msync() as something taking a
> > long time.  In actual fact both of these are basically equivalent
> > operations, the differences being, that you can easily modify
> > non-contiguous parts of a file with mmap, while you can't do that with
> > write.  The disadvantage from mmap comes from the cost of setting up
> > the page tables and handling the faults.
> >
> > Think of it this way:
> >
> >   shared mmap write + msync(MS_ASYNC)  ==  write()
> >   msync(MS_ASYNC) + fsync()  ==  msync(MS_SYNC)
> >
> >   
> 
> I don't believe that this is a valid characterization because the
> changes to the contents of the file, made through the mmap'd region,
> are immediately visible to any and all other applications accessing
> the file.  Since the contents of the file are changing, then so
> should the timestamps to reflect this.

Same case with a large write().  Nothing prevents you from reading a
file, while a huge write is taking place to it, and yet, the
modification time isn't updated.

> I think that we are going to have to agree to disagree because
> I don't agree either with your characterizations of the desirable
> semantics associated with shared mmap or that maintaining the
> correctness in the system is a waste of CPU.

I didn't quite say _that_ in so many words :).  I said that updating
the timestamp on a per-page first dirtying base, or per-inode first
dirtying base is a waste of effort.  Why?

What happens if the application overwrites what it had written some
time later?  Nothing.  The page is already read-write, the pte dirty,
so even though the file was clearly modified, there's absolutely no
way in which this can be used to force an update to the timestamp.

Is there anything special about the _first_ modification?  I don't
think so.  From an external application's point of view it doesn't
matter one whit, whether a modification was through write() or after a
page-fault, or on an already present read-write page.

So what exactly _are_ the semantics we are trying to achieve?

> I view mmap as a way for an application to treat the contents of a
> file as another segment in its address space.  This allows it to
> manipulate the contents of a file without incurring the overhead of
> the read and write system calls and the double buffering that
> naturally occurs with those system calls.  I think that:
> 
>     char *p = mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
>     *p = 1;
>     *(p + 4096) = 2;
> 
> should have the same effect as:
> 
>     char c = 1;
>     pwrite(fd, &c, 1, 0);
>     c = 2;
>     pwrite(fd, &c, 1, 4096);

Not necessarily.  This is the equivalent _portable_ call sequence:

     char *p = mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
     *p = 1;
     *(p + 4096) = 2;
     msync(p, 4097, MS_ASYNC);

Yes, on linux the prior would work too, but there's really no point in
allowing applications to be lax and not do it properly.  But we've
been over this.

Miklos

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 01/22] update ctime and mtime for mmaped write
  2007-02-28 20:35             ` Miklos Szeredi
@ 2007-02-28 20:58               ` Miklos Szeredi
  2007-02-28 21:09                 ` Peter Staubach
  0 siblings, 1 reply; 32+ messages in thread
From: Miklos Szeredi @ 2007-02-28 20:58 UTC (permalink / raw)
  To: staubach; +Cc: akpm, linux-kernel, linux-fsdevel

> What happens if the application overwrites what it had written some
> time later?  Nothing.  The page is already read-write, the pte dirty,
> so even though the file was clearly modified, there's absolutely no
> way in which this can be used to force an update to the timestamp.

Which, I realize now, actually means, that the patch is wrong.  Msync
will have to write protect the page table entries, so that later
dirtyings may have an effect on the timestamp.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 01/22] update ctime and mtime for mmaped write
  2007-02-28 20:58               ` Miklos Szeredi
@ 2007-02-28 21:09                 ` Peter Staubach
  2007-03-01  7:25                   ` Miklos Szeredi
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Staubach @ 2007-02-28 21:09 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-fsdevel

Miklos Szeredi wrote:
>> What happens if the application overwrites what it had written some
>> time later?  Nothing.  The page is already read-write, the pte dirty,
>> so even though the file was clearly modified, there's absolutely no
>> way in which this can be used to force an update to the timestamp.
>>     
>
> Which, I realize now, actually means, that the patch is wrong.  Msync
> will have to write protect the page table entries, so that later
> dirtyings may have an effect on the timestamp.

I thought that PeterZ's changes were to write-protect the page after
cleaning it so that future modifications could be detected and tracked
accordingly?  Does the right thing not happen already?

    Thanx...

       ps

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 01/22] update ctime and mtime for mmaped write
  2007-02-28 21:09                 ` Peter Staubach
@ 2007-03-01  7:25                   ` Miklos Szeredi
  0 siblings, 0 replies; 32+ messages in thread
From: Miklos Szeredi @ 2007-03-01  7:25 UTC (permalink / raw)
  To: staubach; +Cc: miklos, akpm, linux-kernel, linux-fsdevel

> >> What happens if the application overwrites what it had written some
> >> time later?  Nothing.  The page is already read-write, the pte dirty,
> >> so even though the file was clearly modified, there's absolutely no
> >> way in which this can be used to force an update to the timestamp.
> >>     
> >
> > Which, I realize now, actually means, that the patch is wrong.  Msync
> > will have to write protect the page table entries, so that later
> > dirtyings may have an effect on the timestamp.
> 
> I thought that PeterZ's changes were to write-protect the page after
> cleaning it so that future modifications could be detected and tracked
> accordingly?  Does the right thing not happen already?

Yes, but MS_ASYNC does not clean the pages.

In fact a better solution may be to rely on the dirty bit in the page
tables, so that no more page faults are necessary.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2007-03-01  7:25 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-27 23:14 [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support Miklos Szeredi
2007-02-27 23:14 ` [patch 01/22] update ctime and mtime for mmaped write Miklos Szeredi
2007-02-28 14:16   ` Peter Staubach
2007-02-28 17:06     ` Miklos Szeredi
2007-02-28 17:21       ` Peter Staubach
2007-02-28 17:51         ` Miklos Szeredi
2007-02-28 20:01           ` Peter Staubach
2007-02-28 20:35             ` Miklos Szeredi
2007-02-28 20:58               ` Miklos Szeredi
2007-02-28 21:09                 ` Peter Staubach
2007-03-01  7:25                   ` Miklos Szeredi
2007-02-27 23:14 ` [patch 02/22] fix quadratic behavior of shrink_dcache_parent() Miklos Szeredi
2007-02-27 23:14 ` [patch 03/22] fix deadlock in balance_dirty_pages Miklos Szeredi
2007-02-27 23:14 ` [patch 04/22] fix deadlock in throttle_vm_writeout Miklos Szeredi
2007-02-27 23:14 ` [patch 05/22] balance dirty pages from loop device Miklos Szeredi
2007-02-27 23:14 ` [patch 06/22] consolidate generic_writepages and mpage_writepages Miklos Szeredi
2007-02-27 23:14 ` [patch 07/22] add filesystem subtype support Miklos Szeredi
2007-02-27 23:14 ` [patch 08/22] fuse: update backing_dev_info congestion state Miklos Szeredi
2007-02-27 23:14 ` [patch 09/22] fuse: fix reserved request wake up Miklos Szeredi
2007-02-27 23:14 ` [patch 10/22] fuse: add reference counting to fuse_file Miklos Szeredi
2007-02-27 23:14 ` [patch 11/22] fuse: add truncation semaphore Miklos Szeredi
2007-02-27 23:14 ` [patch 12/22] fuse: fix page invalidation Miklos Szeredi
2007-02-27 23:14 ` [patch 13/22] fuse: add list of writable files to fuse_inode Miklos Szeredi
2007-02-27 23:14 ` [patch 14/22] fuse: add helper for asynchronous writes Miklos Szeredi
2007-02-27 23:14 ` [patch 15/22] add non-owner variant of down_read_trylock() Miklos Szeredi
2007-02-27 23:14 ` [patch 16/22] fuse: add fuse_writepage() function Miklos Szeredi
2007-02-27 23:14 ` [patch 17/22] fuse: writable shared mmap support Miklos Szeredi
2007-02-27 23:15 ` [patch 18/22] fuse: add fuse_writepages() function Miklos Szeredi
2007-02-27 23:15 ` [patch 19/22] export sync_sb() to modules Miklos Szeredi
2007-02-27 23:15 ` [patch 20/22] fuse: make dirty stats available Miklos Szeredi
2007-02-27 23:15 ` [patch 21/22] fuse: limit dirty pages Miklos Szeredi
2007-02-27 23:15 ` [patch 22/22] fuse: allow big write requests Miklos Szeredi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.