linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/8] VFS/VM patches
@ 2007-03-06 18:04 Miklos Szeredi
  2007-03-06 18:04 ` [patch 1/8] fix race in clear_page_dirty_for_io() Miklos Szeredi
                   ` (7 more replies)
  0 siblings, 8 replies; 28+ messages in thread
From: Miklos Szeredi @ 2007-03-06 18:04 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

There's a new patch, which fixes the race in clear_page_dirty_for_io()
and cleans up the page dirtying logic.  And the mmap mtime/ctime
update patch has seen much improvement.  The others are resends and
fixes.

I'm not resending the fuse writable mmap patches until the fate of the
deadlock fixes is decided.

Thanks
Miklos

--

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 1/8] fix race in clear_page_dirty_for_io()
  2007-03-06 18:04 [patch 0/8] VFS/VM patches Miklos Szeredi
@ 2007-03-06 18:04 ` Miklos Szeredi
  2007-03-06 22:25   ` Andrew Morton
  2007-03-06 18:04 ` [patch 2/8] update ctime and mtime for mmaped write Miklos Szeredi
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Miklos Szeredi @ 2007-03-06 18:04 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel, torvalds

[-- Attachment #1: split_set_page_dirty.patch --]
[-- Type: text/plain, Size: 39517 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

Fix race in clear_page_dirty_for_io() as suggested by Linus.

Split the page dirtying into two phases:

  1) done for each dirtying attempt
  2) done only when the page's dirty flag changed from clean to dirty

For 1 a_ops->set_page_dirty() is used (couldn't find a better name).
For 2 a_ops->page_dirtied() is introduced.

__set_page_dirty_buffers() is renamed to block_set_page_dirty() and
only does phase 1 (buffer head dirtying).

__set_page_dirty_nobuffers() only did phase 2, make this into
generic_page_dirtied().  This doesn't set the dirty flag.

A new function __set_page_dirty() sets the dirty flag, and if page was
clean before, calls a_ops->page_dirtied.

set_page_dirty() calls a_ops->set_page_dirty() and then calls
__set_page_dirty().  The imlementation is slightly different, so that
mapping is not recalculated so many times.

clear_page_dirty_for_io can now be done race free, by first clearing
the page dirty flag, and if dirty page accounting is enabled,
write-prtotecting the ptes.  If some of the ptes were dirty/writable
it calls a_ops->set_page_dirty.

If a_ops->set_page_dirty is NULL, fall back to block_set_page_dirty().
If a_ops->page_dirtied is NULL, fall back to generic_page_dirtied().

Two new functions are introduced, both no-ops:

  simple_set_page_dirty()
  simple_page_dirtied()

These are for the cases, when either method doesn't need to do
anything.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/arch/x86_64/kernel/functionlist
===================================================================
--- linux.orig/arch/x86_64/kernel/functionlist	2007-03-06 15:28:19.000000000 +0100
+++ linux/arch/x86_64/kernel/functionlist	2007-03-06 15:28:26.000000000 +0100
@@ -78,7 +78,6 @@
 *(.text.free_pages_and_swap_cache)
 *(.text.generic_fillattr)
 *(.text.__block_prepare_write)
-*(.text.__set_page_dirty_nobuffers)
 *(.text.link_path_walk)
 *(.text.find_get_pages_tag)
 *(.text.ide_do_request)
Index: linux/drivers/block/rd.c
===================================================================
--- linux.orig/drivers/block/rd.c	2007-03-06 15:28:19.000000000 +0100
+++ linux/drivers/block/rd.c	2007-03-06 15:28:26.000000000 +0100
@@ -178,23 +178,13 @@ static int ramdisk_writepages(struct add
 	return 0;
 }
 
-/*
- * ramdisk blockdev pages have their own ->set_page_dirty() because we don't
- * want them to contribute to dirty memory accounting.
- */
-static int ramdisk_set_page_dirty(struct page *page)
-{
-	if (!TestSetPageDirty(page))
-		return 1;
-	return 0;
-}
-
 static const struct address_space_operations ramdisk_aops = {
 	.readpage	= ramdisk_readpage,
 	.prepare_write	= ramdisk_prepare_write,
 	.commit_write	= ramdisk_commit_write,
 	.writepage	= ramdisk_writepage,
-	.set_page_dirty	= ramdisk_set_page_dirty,
+	.set_page_dirty	= simple_set_page_dirty,
+	.page_dirtied	= simple_page_dirtied,
 	.writepages	= ramdisk_writepages,
 };
 
Index: linux/fs/afs/file.c
===================================================================
--- linux.orig/fs/afs/file.c	2007-03-06 15:28:19.000000000 +0100
+++ linux/fs/afs/file.c	2007-03-06 15:28:26.000000000 +0100
@@ -35,7 +35,8 @@ const struct inode_operations afs_file_i
 
 const struct address_space_operations afs_fs_aops = {
 	.readpage	= afs_file_readpage,
-	.set_page_dirty	= __set_page_dirty_nobuffers,
+	.set_page_dirty	= simple_set_page_dirty,
+	.page_dirtied	= generic_page_dirtied,
 	.releasepage	= afs_file_releasepage,
 	.invalidatepage	= afs_file_invalidatepage,
 };
Index: linux/fs/buffer.c
===================================================================
--- linux.orig/fs/buffer.c	2007-03-06 15:28:19.000000000 +0100
+++ linux/fs/buffer.c	2007-03-06 15:28:26.000000000 +0100
@@ -683,7 +683,7 @@ void mark_buffer_dirty_inode(struct buff
 EXPORT_SYMBOL(mark_buffer_dirty_inode);
 
 /*
- * Add a page to the dirty page list.
+ * Set page's buffers dirty.
  *
  * It is a sad fact of life that this function is called from several places
  * deeply under spinlocking.  It may not sleep.
@@ -693,7 +693,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  * not have buffers then when they are later attached they will all be set
  * dirty.
  *
- * The buffers are dirtied before the page is dirtied.  There's a small race
+ * This function is called before the page is dirtied.  There's a small race
  * window in which a writepage caller may see the page cleanness but not the
  * buffer dirtiness.  That's fine.  If this code were to set the page dirty
  * before the buffers, a concurrent writepage caller could clear the page dirty
@@ -707,12 +707,12 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  * FIXME: may need to call ->reservepage here as well.  That's rather up to the
  * address_space though.
  */
-int __set_page_dirty_buffers(struct page *page)
+void block_set_page_dirty(struct page *page)
 {
-	struct address_space * const mapping = page_mapping(page);
+	struct address_space *mapping = page_mapping(page);
 
 	if (unlikely(!mapping))
-		return !TestSetPageDirty(page);
+		return;
 
 	spin_lock(&mapping->private_lock);
 	if (page_has_buffers(page)) {
@@ -725,24 +725,8 @@ int __set_page_dirty_buffers(struct page
 		} while (bh != head);
 	}
 	spin_unlock(&mapping->private_lock);
-
-	if (TestSetPageDirty(page))
-		return 0;
-
-	write_lock_irq(&mapping->tree_lock);
-	if (page->mapping) {	/* Race with truncate? */
-		if (mapping_cap_account_dirty(mapping)) {
-			__inc_zone_page_state(page, NR_FILE_DIRTY);
-			task_io_account_write(PAGE_CACHE_SIZE);
-		}
-		radix_tree_tag_set(&mapping->page_tree,
-				page_index(page), PAGECACHE_TAG_DIRTY);
-	}
-	write_unlock_irq(&mapping->tree_lock);
-	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-	return 1;
 }
-EXPORT_SYMBOL(__set_page_dirty_buffers);
+EXPORT_SYMBOL(block_set_page_dirty);
 
 /*
  * Write out and wait upon a list of buffers.
@@ -1144,7 +1128,7 @@ __getblk_slow(struct block_device *bdev,
 void fastcall mark_buffer_dirty(struct buffer_head *bh)
 {
 	if (!buffer_dirty(bh) && !test_set_buffer_dirty(bh))
-		__set_page_dirty_nobuffers(bh->b_page);
+		__set_page_dirty(bh->b_page);
 }
 
 /*
@@ -1497,7 +1481,7 @@ EXPORT_SYMBOL(block_invalidatepage);
 
 /*
  * We attach and possibly dirty the buffers atomically wrt
- * __set_page_dirty_buffers() via private_lock.  try_to_free_buffers
+ * block_set_page_dirty() via private_lock.  try_to_free_buffers
  * is already excluded via the page lock.
  */
 void create_empty_buffers(struct page *page,
@@ -1607,12 +1591,12 @@ static int __block_write_full_page(struc
 	}
 
 	/*
-	 * Be very careful.  We have no exclusion from __set_page_dirty_buffers
+	 * Be very careful.  We have no exclusion from block_set_page_dirty
 	 * here, and the (potentially unmapped) buffers may become dirty at
 	 * any time.  If a buffer becomes dirty here after we've inspected it
 	 * then we just miss that fact, and the page stays dirty.
 	 *
-	 * Buffers outside i_size may be dirtied by __set_page_dirty_buffers;
+	 * Buffers outside i_size may be dirtied by block_set_page_dirty;
 	 * handle that here by just cleaning them.
 	 */
 
@@ -2775,7 +2759,7 @@ int sync_dirty_buffer(struct buffer_head
  *
  * The same applies to regular filesystem pages: if all the buffers are
  * clean then we set the page clean and proceed.  To do that, we require
- * total exclusion from __set_page_dirty_buffers().  That is obtained with
+ * total exclusion from block_set_page_dirty().  That is obtained with
  * private_lock.
  *
  * try_to_free_buffers() is non-blocking.
@@ -2844,7 +2828,7 @@ int try_to_free_buffers(struct page *pag
 	 * the page also.
 	 *
 	 * private_lock must be held over this entire operation in order
-	 * to synchronise against __set_page_dirty_buffers and prevent the
+	 * to synchronise against block_set_page_dirty and prevent the
 	 * dirty bit from being lost.
 	 */
 	if (ret)
Index: linux/fs/cifs/file.c
===================================================================
--- linux.orig/fs/cifs/file.c	2007-03-06 15:28:19.000000000 +0100
+++ linux/fs/cifs/file.c	2007-03-06 15:28:26.000000000 +0100
@@ -2027,7 +2027,8 @@ const struct address_space_operations ci
 	.writepages = cifs_writepages,
 	.prepare_write = cifs_prepare_write,
 	.commit_write = cifs_commit_write,
-	.set_page_dirty = __set_page_dirty_nobuffers,
+	.set_page_dirty = simple_set_page_dirty,
+	.page_dirtied = generic_page_dirtied,
 	/* .sync_page = cifs_sync_page, */
 	/* .direct_IO = */
 };
@@ -2043,7 +2044,8 @@ const struct address_space_operations ci
 	.writepages = cifs_writepages,
 	.prepare_write = cifs_prepare_write,
 	.commit_write = cifs_commit_write,
-	.set_page_dirty = __set_page_dirty_nobuffers,
+	.set_page_dirty = simple_set_page_dirty,
+	.page_dirtied = generic_page_dirtied,
 	/* .sync_page = cifs_sync_page, */
 	/* .direct_IO = */
 };
Index: linux/fs/ext3/inode.c
===================================================================
--- linux.orig/fs/ext3/inode.c	2007-03-06 15:28:19.000000000 +0100
+++ linux/fs/ext3/inode.c	2007-03-06 15:28:26.000000000 +0100
@@ -1761,10 +1761,9 @@ out:
  * So what we do is to mark the page "pending dirty" and next time writepage
  * is called, propagate that into the buffers appropriately.
  */
-static int ext3_journalled_set_page_dirty(struct page *page)
+static void ext3_journalled_set_page_dirty(struct page *page)
 {
 	SetPageChecked(page);
-	return __set_page_dirty_nobuffers(page);
 }
 
 static const struct address_space_operations ext3_ordered_aops = {
@@ -1803,6 +1802,7 @@ static const struct address_space_operat
 	.prepare_write	= ext3_prepare_write,
 	.commit_write	= ext3_journalled_commit_write,
 	.set_page_dirty	= ext3_journalled_set_page_dirty,
+	.page_dirtied	= generic_page_dirtied,
 	.bmap		= ext3_bmap,
 	.invalidatepage	= ext3_invalidatepage,
 	.releasepage	= ext3_releasepage,
Index: linux/fs/ext4/inode.c
===================================================================
--- linux.orig/fs/ext4/inode.c	2007-03-06 15:28:19.000000000 +0100
+++ linux/fs/ext4/inode.c	2007-03-06 15:28:26.000000000 +0100
@@ -1760,10 +1760,9 @@ out:
  * So what we do is to mark the page "pending dirty" and next time writepage
  * is called, propagate that into the buffers appropriately.
  */
-static int ext4_journalled_set_page_dirty(struct page *page)
+static void ext4_journalled_set_page_dirty(struct page *page)
 {
 	SetPageChecked(page);
-	return __set_page_dirty_nobuffers(page);
 }
 
 static const struct address_space_operations ext4_ordered_aops = {
@@ -1802,6 +1801,7 @@ static const struct address_space_operat
 	.prepare_write	= ext4_prepare_write,
 	.commit_write	= ext4_journalled_commit_write,
 	.set_page_dirty	= ext4_journalled_set_page_dirty,
+	.page_dirtied	= generic_page_dirtied,
 	.bmap		= ext4_bmap,
 	.invalidatepage	= ext4_invalidatepage,
 	.releasepage	= ext4_releasepage,
Index: linux/fs/fuse/file.c
===================================================================
--- linux.orig/fs/fuse/file.c	2007-03-06 15:28:19.000000000 +0100
+++ linux/fs/fuse/file.c	2007-03-06 15:28:26.000000000 +0100
@@ -625,11 +625,10 @@ static int fuse_file_mmap(struct file *f
 	return generic_file_mmap(file, vma);
 }
 
-static int fuse_set_page_dirty(struct page *page)
+static void fuse_page_dirtied(struct page *page)
 {
-	printk("fuse_set_page_dirty: should not happen\n");
+	printk("fuse_page_dirtied: should not happen\n");
 	dump_stack();
-	return 0;
 }
 
 static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
@@ -820,7 +819,8 @@ static const struct address_space_operat
 	.prepare_write	= fuse_prepare_write,
 	.commit_write	= fuse_commit_write,
 	.readpages	= fuse_readpages,
-	.set_page_dirty	= fuse_set_page_dirty,
+	.set_page_dirty	= simple_set_page_dirty,
+	.page_dirtied	= fuse_page_dirtied,
 	.bmap		= fuse_bmap,
 };
 
Index: linux/fs/hostfs/hostfs_kern.c
===================================================================
--- linux.orig/fs/hostfs/hostfs_kern.c	2007-03-06 15:28:19.000000000 +0100
+++ linux/fs/hostfs/hostfs_kern.c	2007-03-06 15:28:26.000000000 +0100
@@ -518,7 +518,8 @@ int hostfs_commit_write(struct file *fil
 static const struct address_space_operations hostfs_aops = {
 	.writepage 	= hostfs_writepage,
 	.readpage	= hostfs_readpage,
-	.set_page_dirty = __set_page_dirty_nobuffers,
+	.set_page_dirty = simple_set_page_dirty,
+	.page_dirtied	= generic_page_dirtied,
 	.prepare_write	= hostfs_prepare_write,
 	.commit_write	= hostfs_commit_write
 };
Index: linux/fs/hugetlbfs/inode.c
===================================================================
--- linux.orig/fs/hugetlbfs/inode.c	2007-03-06 15:28:19.000000000 +0100
+++ linux/fs/hugetlbfs/inode.c	2007-03-06 15:28:26.000000000 +0100
@@ -451,12 +451,11 @@ static int hugetlbfs_symlink(struct inod
 /*
  * mark the head page dirty
  */
-static int hugetlbfs_set_page_dirty(struct page *page)
+static void hugetlbfs_set_page_dirty(struct page *page)
 {
 	struct page *head = (struct page *)page_private(page);
 
 	SetPageDirty(head);
-	return 0;
 }
 
 static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
@@ -545,6 +544,7 @@ static const struct address_space_operat
 	.prepare_write	= hugetlbfs_prepare_write,
 	.commit_write	= hugetlbfs_commit_write,
 	.set_page_dirty	= hugetlbfs_set_page_dirty,
+	.page_dirtied	= simple_page_dirtied,
 };
 
 
Index: linux/fs/jfs/jfs_metapage.c
===================================================================
--- linux.orig/fs/jfs/jfs_metapage.c	2007-03-06 15:28:19.000000000 +0100
+++ linux/fs/jfs/jfs_metapage.c	2007-03-06 15:28:26.000000000 +0100
@@ -583,7 +583,8 @@ const struct address_space_operations jf
 	.sync_page	= block_sync_page,
 	.releasepage	= metapage_releasepage,
 	.invalidatepage	= metapage_invalidatepage,
-	.set_page_dirty	= __set_page_dirty_nobuffers,
+	.set_page_dirty	= simple_set_page_dirty,
+	.page_dirtied	= generic_page_dirtied,
 };
 
 struct metapage *__get_metapage(struct inode *inode, unsigned long lblock,
Index: linux/fs/libfs.c
===================================================================
--- linux.orig/fs/libfs.c	2007-03-06 15:28:19.000000000 +0100
+++ linux/fs/libfs.c	2007-03-06 15:28:26.000000000 +0100
@@ -357,6 +357,21 @@ int simple_commit_write(struct file *fil
 	return 0;
 }
 
+/*
+ * For address_spaces which do not use buffers
+ */
+void simple_set_page_dirty(struct page *page)
+{
+}
+
+/*
+ * For address spaces which don't need the dirty radix tree tag, and
+ * the inode dirtiness
+ */
+void simple_page_dirtied(struct page *page)
+{
+}
+
 int simple_fill_super(struct super_block *s, int magic, struct tree_descr *files)
 {
 	static struct super_operations s_ops = {.statfs = simple_statfs};
@@ -615,6 +630,8 @@ EXPORT_SYMBOL(dcache_readdir);
 EXPORT_SYMBOL(generic_read_dir);
 EXPORT_SYMBOL(get_sb_pseudo);
 EXPORT_SYMBOL(simple_commit_write);
+EXPORT_SYMBOL(simple_set_page_dirty);
+EXPORT_SYMBOL(simple_page_dirtied);
 EXPORT_SYMBOL(simple_dir_inode_operations);
 EXPORT_SYMBOL(simple_dir_operations);
 EXPORT_SYMBOL(simple_empty);
Index: linux/fs/mpage.c
===================================================================
--- linux.orig/fs/mpage.c	2007-03-06 15:28:19.000000000 +0100
+++ linux/fs/mpage.c	2007-03-06 15:28:26.000000000 +0100
@@ -490,7 +490,7 @@ __mpage_writepage(struct bio *bio, struc
 			if (!buffer_mapped(bh)) {
 				/*
 				 * unmapped dirty buffers are created by
-				 * __set_page_dirty_buffers -> mmapped data
+				 * block_set_page_dirty -> mmapped data
 				 */
 				if (buffer_dirty(bh))
 					goto confused;
Index: linux/fs/nfs/write.c
===================================================================
--- linux.orig/fs/nfs/write.c	2007-03-06 15:28:19.000000000 +0100
+++ linux/fs/nfs/write.c	2007-03-06 15:28:26.000000000 +0100
@@ -406,7 +406,7 @@ static void
 nfs_redirty_request(struct nfs_page *req)
 {
 	clear_bit(PG_FLUSHING, &req->wb_flags);
-	__set_page_dirty_nobuffers(req->wb_page);
+	__set_page_dirty(req->wb_page);
 }
 
 /*
@@ -716,7 +716,7 @@ int nfs_updatepage(struct file *file, st
 	}
 
 	status = nfs_writepage_setup(ctx, page, offset, count);
-	__set_page_dirty_nobuffers(page);
+	__set_page_dirty(page);
 
         dprintk("NFS:      nfs_updatepage returns %d (isize %Ld)\n",
 			status, (long long)i_size_read(inode));
@@ -1481,7 +1481,7 @@ int nfs_wb_page(struct inode *inode, str
 	return nfs_wb_page_priority(inode, page, FLUSH_STABLE);
 }
 
-int nfs_set_page_dirty(struct page *page)
+void nfs_set_page_dirty(struct page *page)
 {
 	struct nfs_page *req;
 
@@ -1491,7 +1491,6 @@ int nfs_set_page_dirty(struct page *page
 		set_bit(PG_NEED_FLUSH, &req->wb_flags);
 		nfs_release_request(req);
 	}
-	return __set_page_dirty_nobuffers(page);
 }
 
 
Index: linux/fs/ntfs/aops.c
===================================================================
--- linux.orig/fs/ntfs/aops.c	2007-03-06 15:28:19.000000000 +0100
+++ linux/fs/ntfs/aops.c	2007-03-06 15:28:26.000000000 +0100
@@ -613,12 +613,12 @@ static int ntfs_write_block(struct page 
 	iblock = initialized_size >> blocksize_bits;
 
 	/*
-	 * Be very careful.  We have no exclusion from __set_page_dirty_buffers
+	 * Be very careful.  We have no exclusion from block_set_page_dirty
 	 * here, and the (potentially unmapped) buffers may become dirty at
 	 * any time.  If a buffer becomes dirty here after we've inspected it
 	 * then we just miss that fact, and the page stays dirty.
 	 *
-	 * Buffers outside i_size may be dirtied by __set_page_dirty_buffers;
+	 * Buffers outside i_size may be dirtied by block_set_page_dirty;
 	 * handle that here by just cleaning them.
 	 */
 
@@ -673,7 +673,7 @@ static int ntfs_write_block(struct page 
 				// Update initialized size in the attribute and
 				// in the inode.
 				// Again, for each page do:
-				//	__set_page_dirty_buffers();
+				//	block_set_page_dirty();
 				// page_cache_release()
 				// We don't need to wait on the writes.
 				// Update iblock.
@@ -1570,9 +1570,10 @@ const struct address_space_operations nt
 						   disk request queue. */
 #ifdef NTFS_RW
 	.writepage	= ntfs_writepage,	/* Write dirty page to disk. */
-	.set_page_dirty	= __set_page_dirty_nobuffers,	/* Set the page dirty
+	.set_page_dirty	= simple_set_page_dirty,/* Set the page dirty
 						   without touching the buffers
 						   belonging to the page. */
+	.page_dirtied	= generic_page_dirtied,
 #endif /* NTFS_RW */
 	.migratepage	= buffer_migrate_page,	/* Move a page cache page from
 						   one physical page to an
@@ -1634,7 +1635,7 @@ void mark_ntfs_record_dirty(struct page 
 		set_buffer_dirty(bh);
 	} while ((bh = bh->b_this_page) != head);
 	spin_unlock(&mapping->private_lock);
-	__set_page_dirty_nobuffers(page);
+	__set_page_dirty(page);
 	if (unlikely(buffers_to_free)) {
 		do {
 			bh = buffers_to_free->b_this_page;
Index: linux/fs/ntfs/file.c
===================================================================
--- linux.orig/fs/ntfs/file.c	2007-03-06 15:28:19.000000000 +0100
+++ linux/fs/ntfs/file.c	2007-03-06 15:28:26.000000000 +0100
@@ -1786,7 +1786,7 @@ err_out:
 			 * Put the page on mapping->dirty_pages, but leave its
 			 * buffers' dirty state as-is.
 			 */
-			__set_page_dirty_nobuffers(page);
+			__set_page_dirty(page);
 			err = 0;
 		} else
 			ntfs_error(vi->i_sb, "Page is not uptodate.  Written "
Index: linux/fs/ramfs/file-mmu.c
===================================================================
--- linux.orig/fs/ramfs/file-mmu.c	2007-03-06 15:28:19.000000000 +0100
+++ linux/fs/ramfs/file-mmu.c	2007-03-06 15:28:26.000000000 +0100
@@ -31,7 +31,8 @@ const struct address_space_operations ra
 	.readpage	= simple_readpage,
 	.prepare_write	= simple_prepare_write,
 	.commit_write	= simple_commit_write,
-	.set_page_dirty = __set_page_dirty_no_writeback,
+	.set_page_dirty	= simple_set_page_dirty,
+	.page_dirtied	= simple_page_dirtied,
 };
 
 const struct file_operations ramfs_file_operations = {
Index: linux/fs/ramfs/file-nommu.c
===================================================================
--- linux.orig/fs/ramfs/file-nommu.c	2007-03-06 15:28:19.000000000 +0100
+++ linux/fs/ramfs/file-nommu.c	2007-03-06 15:28:26.000000000 +0100
@@ -32,7 +32,8 @@ const struct address_space_operations ra
 	.readpage		= simple_readpage,
 	.prepare_write		= simple_prepare_write,
 	.commit_write		= simple_commit_write,
-	.set_page_dirty		= __set_page_dirty_no_writeback,
+	.set_page_dirty		= simple_set_page_dirty,
+	.page_dirtied		= simple_page_dirtied,
 };
 
 const struct file_operations ramfs_file_operations = {
Index: linux/fs/reiserfs/inode.c
===================================================================
--- linux.orig/fs/reiserfs/inode.c	2007-03-06 15:28:19.000000000 +0100
+++ linux/fs/reiserfs/inode.c	2007-03-06 15:28:26.000000000 +0100
@@ -2838,14 +2838,13 @@ static void reiserfs_invalidatepage(stru
 	return;
 }
 
-static int reiserfs_set_page_dirty(struct page *page)
+static void reiserfs_set_page_dirty(struct page *page)
 {
 	struct inode *inode = page->mapping->host;
-	if (reiserfs_file_data_log(inode)) {
+	if (reiserfs_file_data_log(inode))
 		SetPageChecked(page);
-		return __set_page_dirty_nobuffers(page);
-	}
-	return __set_page_dirty_buffers(page);
+	else
+		block_set_page_dirty(page);
 }
 
 /*
@@ -3012,4 +3011,5 @@ const struct address_space_operations re
 	.bmap = reiserfs_aop_bmap,
 	.direct_IO = reiserfs_direct_IO,
 	.set_page_dirty = reiserfs_set_page_dirty,
+	.page_dirtied = generic_page_dirtied,
 };
Index: linux/include/linux/buffer_head.h
===================================================================
--- linux.orig/include/linux/buffer_head.h	2007-03-06 15:28:19.000000000 +0100
+++ linux/include/linux/buffer_head.h	2007-03-06 15:28:26.000000000 +0100
@@ -308,7 +308,7 @@ static inline void lock_buffer(struct bu
 		__lock_buffer(bh);
 }
 
-extern int __set_page_dirty_buffers(struct page *page);
+extern void block_set_page_dirty(struct page *page);
 
 #else /* CONFIG_BLOCK */
 
Index: linux/include/linux/fs.h
===================================================================
--- linux.orig/include/linux/fs.h	2007-03-06 15:28:19.000000000 +0100
+++ linux/include/linux/fs.h	2007-03-06 15:36:44.000000000 +0100
@@ -404,8 +404,11 @@ struct address_space_operations {
 	/* Write back some dirty pages from this mapping. */
 	int (*writepages)(struct address_space *, struct writeback_control *);
 
-	/* Set a page dirty.  Return true if this dirtied it */
-	int (*set_page_dirty)(struct page *page);
+	/* Called for each page dirtying attempt, before dirty flag is set */
+	void (*set_page_dirty)(struct page *page);
+
+	/* Called if the page state has changed from clean to dirty */
+	void (*page_dirtied)(struct page *page);
 
 	int (*readpages)(struct file *filp, struct address_space *mapping,
 			struct list_head *pages, unsigned nr_pages);
@@ -1869,6 +1872,9 @@ extern int simple_prepare_write(struct f
 			unsigned offset, unsigned to);
 extern int simple_commit_write(struct file *file, struct page *page,
 				unsigned offset, unsigned to);
+extern void simple_set_page_dirty(struct page *page);
+extern void simple_page_dirtied(struct page *page);
+extern void generic_page_dirtied(struct page *page);
 
 extern struct dentry *simple_lookup(struct inode *, struct dentry *, struct nameidata *);
 extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t *);
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h	2007-03-06 15:28:19.000000000 +0100
+++ linux/include/linux/mm.h	2007-03-06 15:29:22.000000000 +0100
@@ -785,8 +785,7 @@ void print_bad_pte(struct vm_area_struct
 extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
 extern void do_invalidatepage(struct page *page, unsigned long offset);
 
-int __set_page_dirty_nobuffers(struct page *page);
-int __set_page_dirty_no_writeback(struct page *page);
+int __set_page_dirty(struct page *page);
 int redirty_page_for_writepage(struct writeback_control *wbc,
 				struct page *page);
 int FASTCALL(set_page_dirty(struct page *page));
Index: linux/include/linux/nfs_fs.h
===================================================================
--- linux.orig/include/linux/nfs_fs.h	2007-03-06 15:28:19.000000000 +0100
+++ linux/include/linux/nfs_fs.h	2007-03-06 15:28:26.000000000 +0100
@@ -421,7 +421,7 @@ extern int  nfs_flush_incompatible(struc
 extern int  nfs_updatepage(struct file *, struct page *, unsigned int, unsigned int);
 extern int nfs_writeback_done(struct rpc_task *, struct nfs_write_data *);
 extern void nfs_writedata_release(void *);
-extern int nfs_set_page_dirty(struct page *);
+extern void nfs_set_page_dirty(struct page *);
 
 /*
  * Try to write back everything synchronously (but check the
Index: linux/mm/filemap.c
===================================================================
--- linux.orig/mm/filemap.c	2007-03-06 15:28:19.000000000 +0100
+++ linux/mm/filemap.c	2007-03-06 15:28:26.000000000 +0100
@@ -60,7 +60,7 @@ generic_file_direct_IO(int rw, struct ki
  * Lock ordering:
  *
  *  ->i_mmap_lock		(vmtruncate)
- *    ->private_lock		(__free_pte->__set_page_dirty_buffers)
+ *    ->private_lock		(__free_pte->block_set_page_dirty)
  *      ->swap_lock		(exclusive_swap_page, others)
  *        ->mapping->tree_lock
  *
@@ -101,7 +101,7 @@ generic_file_direct_IO(int rw, struct ki
  *    ->tree_lock		(page_remove_rmap->set_page_dirty)
  *    ->inode_lock		(page_remove_rmap->set_page_dirty)
  *    ->inode_lock		(zap_pte_range->set_page_dirty)
- *    ->private_lock		(zap_pte_range->__set_page_dirty_buffers)
+ *    ->private_lock		(zap_pte_range->block_set_page_dirty)
  *
  *  ->task->proc_lock
  *    ->dcache_lock		(proc_pid_lookup)
Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c	2007-03-06 15:28:19.000000000 +0100
+++ linux/mm/page-writeback.c	2007-03-06 15:29:22.000000000 +0100
@@ -734,22 +734,8 @@ int write_one_page(struct page *page, in
 EXPORT_SYMBOL(write_one_page);
 
 /*
- * For address_spaces which do not use buffers nor write back.
- */
-int __set_page_dirty_no_writeback(struct page *page)
-{
-	if (!PageDirty(page))
-		SetPageDirty(page);
-	return 0;
-}
-
-/*
- * For address_spaces which do not use buffers.  Just tag the page as dirty in
- * its radix tree.
- *
- * This is also used when a single buffer is being dirtied: we want to set the
- * page dirty in that case, but not all the buffers.  This is a "bottom-up"
- * dirtying, whereas __set_page_dirty_buffers() is a "top-down" dirtying.
+ * Tag page dirty in the radix tree and perform dirty page accounting
+ * if required.  Finally mark the inode dirty
  *
  * Most callers have locked the page, which pins the address_space in memory.
  * But zap_pte_range() does not lock the page, however in that case the
@@ -758,36 +744,49 @@ int __set_page_dirty_no_writeback(struct
  * We take care to handle the case where the page was truncated from the
  * mapping by re-checking page_mapping() insode tree_lock.
  */
-int __set_page_dirty_nobuffers(struct page *page)
+void generic_page_dirtied(struct page *page)
 {
-	if (!TestSetPageDirty(page)) {
-		struct address_space *mapping = page_mapping(page);
-		struct address_space *mapping2;
-
-		if (!mapping)
-			return 1;
+	struct address_space *mapping;
 
-		write_lock_irq(&mapping->tree_lock);
-		mapping2 = page_mapping(page);
-		if (mapping2) { /* Race with truncate? */
-			BUG_ON(mapping2 != mapping);
-			if (mapping_cap_account_dirty(mapping)) {
-				__inc_zone_page_state(page, NR_FILE_DIRTY);
-				task_io_account_write(PAGE_CACHE_SIZE);
-			}
-			radix_tree_tag_set(&mapping->page_tree,
-				page_index(page), PAGECACHE_TAG_DIRTY);
-		}
-		write_unlock_irq(&mapping->tree_lock);
-		if (mapping->host) {
-			/* !PageAnon && !swapper_space */
-			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+	write_lock_irq(&mapping->tree_lock);
+	mapping = page_mapping(page);
+	if (mapping) { /* Race with truncate? */
+		if (mapping_cap_account_dirty(mapping)) {
+			__inc_zone_page_state(page, NR_FILE_DIRTY);
+			task_io_account_write(PAGE_CACHE_SIZE);
 		}
+		radix_tree_tag_set(&mapping->page_tree,
+				   page_index(page), PAGECACHE_TAG_DIRTY);
+	}
+	write_unlock_irq(&mapping->tree_lock);
+	if (mapping && mapping->host) {
+		/* !swapper_space */
+		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+	}
+}
+EXPORT_SYMBOL(generic_page_dirtied);
+
+/* The page was dirtied just now */
+static void page_dirtied(struct address_space *mapping, struct page *page)
+{
+	if (mapping->a_ops->page_dirtied)
+		mapping->a_ops->page_dirtied(page);
+	else
+		generic_page_dirtied(page);
+}
+
+/* Set page dirty without calling  a_ops->set_page_dirty */
+int __set_page_dirty(struct page *page)
+{
+	if (!TestSetPageDirty(page)) {
+		struct address_space *mapping = page_mapping(page);
+		if (mapping)
+			page_dirtied(mapping, page);
 		return 1;
 	}
 	return 0;
 }
-EXPORT_SYMBOL(__set_page_dirty_nobuffers);
+EXPORT_SYMBOL(__set_page_dirty);
 
 /*
  * When a writepage implementation decides that it doesn't want to write this
@@ -797,25 +796,36 @@ EXPORT_SYMBOL(__set_page_dirty_nobuffers
 int redirty_page_for_writepage(struct writeback_control *wbc, struct page *page)
 {
 	wbc->pages_skipped++;
-	return __set_page_dirty_nobuffers(page);
+	return __set_page_dirty(page);
 }
 EXPORT_SYMBOL(redirty_page_for_writepage);
 
 /*
- * If the mapping doesn't provide a set_page_dirty a_op, then
- * just fall through and assume that it wants buffer_heads.
+ * If the mapping doesn't provide a set_page_dirty a_op, then assume
+ * that it wants buffer_heads.
  */
+static void set_dirty_every_time(struct address_space *mapping,
+				 struct page *page)
+{
+	if (mapping->a_ops->set_page_dirty)
+		mapping->a_ops->set_page_dirty(page);
+#ifdef CONFIG_BLOCK
+	else
+		block_set_page_dirty(page);
+#endif
+}
+
 int fastcall set_page_dirty(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 
 	if (likely(mapping)) {
-		int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
-#ifdef CONFIG_BLOCK
-		if (!spd)
-			spd = __set_page_dirty_buffers;
-#endif
-		return (*spd)(page);
+		set_dirty_every_time(mapping, page);
+		if (!TestSetPageDirty(page)) {
+			page_dirtied(mapping, page);
+			return 1;
+		}
+		return 0;
 	}
 	if (!PageDirty(page)) {
 		if (!TestSetPageDirty(page))
@@ -862,48 +872,18 @@ EXPORT_SYMBOL(set_page_dirty_lock);
  */
 int clear_page_dirty_for_io(struct page *page)
 {
-	struct address_space *mapping = page_mapping(page);
+	if (TestClearPageDirty(page)) {
+		struct address_space *mapping = page_mapping(page);
+
+		if (mapping && mapping_cap_account_dirty(mapping)) {
+			if (page_mkclean(page))
+				set_dirty_every_time(mapping, page);
 
-	if (mapping && mapping_cap_account_dirty(mapping)) {
-		/*
-		 * Yes, Virginia, this is indeed insane.
-		 *
-		 * We use this sequence to make sure that
-		 *  (a) we account for dirty stats properly
-		 *  (b) we tell the low-level filesystem to
-		 *      mark the whole page dirty if it was
-		 *      dirty in a pagetable. Only to then
-		 *  (c) clean the page again and return 1 to
-		 *      cause the writeback.
-		 *
-		 * This way we avoid all nasty races with the
-		 * dirty bit in multiple places and clearing
-		 * them concurrently from different threads.
-		 *
-		 * Note! Normally the "set_page_dirty(page)"
-		 * has no effect on the actual dirty bit - since
-		 * that will already usually be set. But we
-		 * need the side effects, and it can help us
-		 * avoid races.
-		 *
-		 * We basically use the page "master dirty bit"
-		 * as a serialization point for all the different
-		 * threads doing their things.
-		 *
-		 * FIXME! We still have a race here: if somebody
-		 * adds the page back to the page tables in
-		 * between the "page_mkclean()" and the "TestClearPageDirty()",
-		 * we might have it mapped without the dirty bit set.
-		 */
-		if (page_mkclean(page))
-			set_page_dirty(page);
-		if (TestClearPageDirty(page)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
-			return 1;
 		}
-		return 0;
+		return 1;
 	}
-	return TestClearPageDirty(page);
+	return 0;
 }
 EXPORT_SYMBOL(clear_page_dirty_for_io);
 
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c	2007-03-06 15:28:19.000000000 +0100
+++ linux/mm/rmap.c	2007-03-06 15:29:22.000000000 +0100
@@ -30,7 +30,7 @@
  *             zone->lru_lock (in mark_page_accessed, isolate_lru_page)
  *             swap_lock (in swap_duplicate, swap_info_get)
  *               mmlist_lock (in mmput, drain_mmlist and others)
- *               mapping->private_lock (in __set_page_dirty_buffers)
+ *               mapping->private_lock (in block_set_page_dirty)
  *               inode_lock (in set_page_dirty's __mark_inode_dirty)
  *                 sb_lock (within inode_lock in fs/fs-writeback.c)
  *                 mapping->tree_lock (widely used, in set_page_dirty,
Index: linux/mm/shmem.c
===================================================================
--- linux.orig/mm/shmem.c	2007-03-06 15:28:19.000000000 +0100
+++ linux/mm/shmem.c	2007-03-06 15:28:26.000000000 +0100
@@ -2316,7 +2316,8 @@ static void destroy_inodecache(void)
 
 static const struct address_space_operations shmem_aops = {
 	.writepage	= shmem_writepage,
-	.set_page_dirty	= __set_page_dirty_no_writeback,
+	.set_page_dirty	= simple_set_page_dirty,
+	.page_dirtied	= simple_page_dirtied,
 #ifdef CONFIG_TMPFS
 	.prepare_write	= shmem_prepare_write,
 	.commit_write	= simple_commit_write,
Index: linux/mm/swap_state.c
===================================================================
--- linux.orig/mm/swap_state.c	2007-03-06 15:28:19.000000000 +0100
+++ linux/mm/swap_state.c	2007-03-06 15:28:26.000000000 +0100
@@ -27,7 +27,8 @@
 static const struct address_space_operations swap_aops = {
 	.writepage	= swap_writepage,
 	.sync_page	= block_sync_page,
-	.set_page_dirty	= __set_page_dirty_nobuffers,
+	.set_page_dirty	= simple_set_page_dirty,
+	.page_dirtied	= generic_page_dirtied,
 	.migratepage	= migrate_page,
 };
 
Index: linux/fs/nfs/file.c
===================================================================
--- linux.orig/fs/nfs/file.c	2007-03-06 15:28:19.000000000 +0100
+++ linux/fs/nfs/file.c	2007-03-06 15:28:26.000000000 +0100
@@ -328,6 +328,7 @@ const struct address_space_operations nf
 	.readpage = nfs_readpage,
 	.readpages = nfs_readpages,
 	.set_page_dirty = nfs_set_page_dirty,
+	.page_dirtied = generic_page_dirtied,
 	.writepage = nfs_writepage,
 	.writepages = nfs_writepages,
 	.prepare_write = nfs_prepare_write,
Index: linux/Documentation/filesystems/Locking
===================================================================
--- linux.orig/Documentation/filesystems/Locking	2007-03-06 15:29:38.000000000 +0100
+++ linux/Documentation/filesystems/Locking	2007-03-06 15:36:20.000000000 +0100
@@ -161,7 +161,8 @@ prototypes:
 	int (*readpage)(struct file *, struct page *);
 	int (*sync_page)(struct page *);
 	int (*writepages)(struct address_space *, struct writeback_control *);
-	int (*set_page_dirty)(struct page *page);
+	void (*set_page_dirty)(struct page *page);
+	void (*page_dirtied)(struct page *page);
 	int (*readpages)(struct file *filp, struct address_space *mapping,
 			struct list_head *pages, unsigned nr_pages);
 	int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
@@ -174,7 +175,7 @@ prototypes:
 	int (*launder_page) (struct page *);
 
 locking rules:
-	All except set_page_dirty may block
+	All except set_page_dirty and page_dirtied may block
 
 			BKL	PageLocked(page)
 writepage:		no	yes, unlocks (see below)
@@ -182,6 +183,7 @@ readpage:		no	yes, unlocks
 sync_page:		no	maybe
 writepages:		no
 set_page_dirty		no	no
+page_dirtied		no	no
 readpages:		no
 prepare_write:		no	yes
 commit_write:		no	yes
@@ -263,10 +265,10 @@ nr_to_write is NULL, all dirty pages mus
 writepages should _only_ write pages which are present on
 mapping->io_pages.
 
-	->set_page_dirty() is called from various places in the kernel
-when the target page is marked as needing writeback.  It may be called
-under spinlock (it cannot block) and is sometimes called with the page
-not locked.
+	->set_page_dirty() and ->page_dirtied() are called from various
+places in the kernel when the target page is marked as needing writeback.
+They may be called under spinlock (they cannot block) and sometimes called
+with the page not locked.
 
 	->bmap() is currently used by legacy ioctl() (FIBMAP) provided by some
 filesystems and by the swapper. The latter will eventually go away. All
Index: linux/Documentation/filesystems/vfs.txt
===================================================================
--- linux.orig/Documentation/filesystems/vfs.txt	2007-03-06 15:33:45.000000000 +0100
+++ linux/Documentation/filesystems/vfs.txt	2007-03-06 15:49:01.000000000 +0100
@@ -504,8 +504,8 @@ address_space has finer control of write
 
 The read process essentially only requires 'readpage'.  The write
 process is more complicated and uses prepare_write/commit_write or
-set_page_dirty to write data into the address_space, and writepage,
-sync_page, and writepages to writeback data to storage.
+set_page_dirty/page_dirtied to write data into the address_space,
+and writepage, sync_page, and writepages to writeback data to storage.
 
 Adding and removing pages to/from an address_space is protected by the
 inode's i_mutex.
@@ -529,7 +529,8 @@ struct address_space_operations {
 	int (*readpage)(struct file *, struct page *);
 	int (*sync_page)(struct page *);
 	int (*writepages)(struct address_space *, struct writeback_control *);
-	int (*set_page_dirty)(struct page *page);
+	void (*set_page_dirty)(struct page *page);
+	void (*page_dirtied)(struct page *page);
 	int (*readpages)(struct file *filp, struct address_space *mapping,
 			struct list_head *pages, unsigned nr_pages);
 	int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
@@ -587,13 +588,19 @@ struct address_space_operations {
   	instead.  This will choose pages from the address space that are
   	tagged as DIRTY and will pass them to ->writepage.
 
-  set_page_dirty: called by the VM to set a page dirty.
-        This is particularly needed if an address space attaches
-        private data to a page, and that data needs to be updated when
-        a page is dirtied.  This is called, for example, when a memory
-	mapped page gets modified.
-	If defined, it should set the PageDirty flag, and the
-        PAGECACHE_TAG_DIRTY tag in the radix tree.
+  set_page_dirty: called by the VM to set a page dirty.  This is particularly
+	needed if an address space attaches private data to a page, and that
+	data needs to be updated when a page is dirtied.  This is called, for
+	example, when a memory mapped page gets modified.  This method is
+	called before setting the PG_Dirty flag for the page, and sometimes
+	even independently(see clear_page_dirty_for_io()).  The default for
+	this operation is block_set_page_dirty().
+
+  page_dirtied: called by the VM after the PG_Dirty flag for a page has been
+	toggled from clear to dirty.  This can be used to set the
+	PAGECACHE_TAG_DIRTY tag in the radix tree, account the number of dirty
+	pages and to set the I_DIRTY_PAGES flag in the inode.  The default for
+	this operation is generic_page_dirtied().
 
   readpages: called by the VM to read pages associated with the address_space
   	object. This is essentially just a vector version of

--

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 2/8] update ctime and mtime for mmaped write
  2007-03-06 18:04 [patch 0/8] VFS/VM patches Miklos Szeredi
  2007-03-06 18:04 ` [patch 1/8] fix race in clear_page_dirty_for_io() Miklos Szeredi
@ 2007-03-06 18:04 ` Miklos Szeredi
  2007-03-06 20:32   ` Peter Zijlstra
  2007-03-06 18:04 ` [patch 3/8] per backing_dev dirty and writeback page accounting Miklos Szeredi
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Miklos Szeredi @ 2007-03-06 18:04 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel, staubach, hugh, a.p.zijlstra

[-- Attachment #1: mmap_mtime.patch --]
[-- Type: text/plain, Size: 14817 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

Changes:
v2:
 o set AS_CMTIME flag in clear_page_dirty_for_io() too
 o don't clear AS_CMTIME in file_update_time()
 o check the dirty bit in the page tables
v1:
 o moved check from __fput() to remove_vma(), which is more logical
 o changed set_page_dirty() to set_page_dirty_mapping in hugetlb.c
 o cleaned up #ifdef CONFIG_BLOCK mess

This patch makes writing to shared memory mappings update st_ctime and
st_mtime as defined by SUSv3:

   The st_ctime and st_mtime fields of a file that is mapped with
   MAP_SHARED and PROT_WRITE shall be marked for update at some point
   in the interval between a write reference to the mapped region and
   the next call to msync() with MS_ASYNC or MS_SYNC for that portion
   of the file by any process. If there is no such call and if the
   underlying file is modified as a result of a write reference, then
   these fields shall be marked for update at some time after the
   write reference.

A new address_space flag is introduced: AS_CMTIME.  This is set each
time a page is dirtied through a userspace memory mapping.  This
includes write accesses via get_user_pages().

Note, the flag is set unconditionally, even if the page is already
dirty.  This is important, because the page might have been dirtied
earlier by a non-mmap write.

This flag is checked in msync() and munmap()/mremap(), and if set, the
file times are updated and the flag is cleared.

Msync also needs to check the dirty bit in the page tables, because
the data might change again after an msync(MS_ASYNC), while the page
is already dirty and read-write.  This also makes the time updating
work for memory backed filesystems such as tmpfs.

This implementation walks the pages in the synced range, and uses rmap
to find all the ptes for each page.  Non-linear vmas are ignored,
since the ptes can only be found by scanning the whole vma, which is
very inefficient.

As an optimization, if dirty pages are accounted, then only walk the
dirty pages, since the clean pages necessarily have clean ptes.  This
doesn't work for memory backed filesystems, where no dirty accounting
is done.

An alternative implementation could check for all intersecting vmas in
the mapping and walk the page tables for each.  This would probably be
more efficient for memory backed filesystems and if the number of
dirty pages is near the total number of pages in the range.

Fixes Novell Bugzilla #206431.

Inspired by Peter Staubach's patch and the resulting comments.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/include/linux/pagemap.h
===================================================================
--- linux.orig/include/linux/pagemap.h	2007-03-06 15:17:42.000000000 +0100
+++ linux/include/linux/pagemap.h	2007-03-06 15:17:46.000000000 +0100
@@ -18,6 +18,7 @@
  */
 #define	AS_EIO		(__GFP_BITS_SHIFT + 0)	/* IO error on async write */
 #define AS_ENOSPC	(__GFP_BITS_SHIFT + 1)	/* ENOSPC on async write */
+#define AS_CMTIME	(__GFP_BITS_SHIFT + 2)	/* ctime/mtime update needed */
 
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h	2007-03-06 15:17:46.000000000 +0100
+++ linux/include/linux/mm.h	2007-03-06 15:17:46.000000000 +0100
@@ -790,6 +790,7 @@ int redirty_page_for_writepage(struct wr
 				struct page *page);
 int FASTCALL(set_page_dirty(struct page *page));
 int set_page_dirty_lock(struct page *page);
+int set_page_dirty_mapping(struct page *page);
 int clear_page_dirty_for_io(struct page *page);
 
 extern unsigned long do_mremap(unsigned long addr,
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c	2007-03-06 15:17:42.000000000 +0100
+++ linux/mm/memory.c	2007-03-06 15:17:46.000000000 +0100
@@ -676,7 +676,7 @@ static unsigned long zap_pte_range(struc
 				anon_rss--;
 			else {
 				if (pte_dirty(ptent))
-					set_page_dirty(page);
+					set_page_dirty_mapping(page);
 				if (pte_young(ptent))
 					SetPageReferenced(page);
 				file_rss--;
@@ -954,7 +954,7 @@ struct page *follow_page(struct vm_area_
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
-			set_page_dirty(page);
+			set_page_dirty_mapping(page);
 		mark_page_accessed(page);
 	}
 unlock:
@@ -1514,6 +1514,15 @@ static inline void cow_user_page(struct 
 	copy_user_highpage(dst, src, va, vma);
 }
 
+static void set_page_dirty_mapping_balance(struct page *page)
+{
+	if (set_page_dirty_mapping(page)) {
+		struct address_space *mapping = page_mapping(page);
+		if (mapping)
+			balance_dirty_pages_ratelimited(mapping);
+	}
+}
+
 /*
  * This routine handles present pages, when users try to write
  * to a shared page. It is done by copying the page to a new address
@@ -1664,7 +1673,7 @@ gotten:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (dirty_page) {
-		set_page_dirty_balance(dirty_page);
+		set_page_dirty_mapping_balance(dirty_page);
 		put_page(dirty_page);
 	}
 	return ret;
@@ -2316,7 +2325,7 @@ retry:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (dirty_page) {
-		set_page_dirty_balance(dirty_page);
+		set_page_dirty_mapping_balance(dirty_page);
 		put_page(dirty_page);
 	}
 	return ret;
Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c	2007-03-06 15:17:46.000000000 +0100
+++ linux/mm/page-writeback.c	2007-03-06 15:17:46.000000000 +0100
@@ -244,16 +244,6 @@ static void balance_dirty_pages(struct a
 		pdflush_operation(background_writeout, 0);
 }
 
-void set_page_dirty_balance(struct page *page)
-{
-	if (set_page_dirty(page)) {
-		struct address_space *mapping = page_mapping(page);
-
-		if (mapping)
-			balance_dirty_pages_ratelimited(mapping);
-	}
-}
-
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
  * @mapping: address_space which was dirtied
@@ -836,6 +826,30 @@ int fastcall set_page_dirty(struct page 
 EXPORT_SYMBOL(set_page_dirty);
 
 /*
+ * Special set_page_dirty() variant for dirtiness coming from a memory
+ * mapping.  In this case the ctime/mtime update flag needs to be set.
+ */
+int set_page_dirty_mapping(struct page *page)
+{
+	struct address_space *mapping = page_mapping(page);
+
+	if (likely(mapping)) {
+		set_bit(AS_CMTIME, &mapping->flags);
+		set_dirty_every_time(mapping, page);
+		if (!TestSetPageDirty(page)) {
+			page_dirtied(mapping, page);
+			return 1;
+		}
+		return 0;
+	}
+	if (!PageDirty(page)) {
+		if (!TestSetPageDirty(page))
+			return 1;
+	}
+	return 0;
+}
+
+/*
  * set_page_dirty() is racy if the caller has no reference against
  * page->mapping->host, and if the page is unlocked.  This is because another
  * CPU could truncate the page off the mapping and then free the mapping.
@@ -876,8 +890,10 @@ int clear_page_dirty_for_io(struct page 
 		struct address_space *mapping = page_mapping(page);
 
 		if (mapping && mapping_cap_account_dirty(mapping)) {
-			if (page_mkclean(page))
+			if (page_mkclean(page)) {
+				set_bit(AS_CMTIME, &mapping->flags);
 				set_dirty_every_time(mapping, page);
+			}
 
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 		}
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c	2007-03-06 15:17:46.000000000 +0100
+++ linux/mm/rmap.c	2007-03-06 15:17:46.000000000 +0100
@@ -498,6 +498,43 @@ int page_mkclean(struct page *page)
 }
 
 /**
+ * is_page_modified - check and clear the dirty bit for all mappings of a page
+ * @page:	the page to check
+ */
+bool is_page_modified(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	struct prio_tree_iter iter;
+	bool modified = false;
+
+	BUG_ON(!mapping);
+	BUG_ON(!page_mapped(page));
+
+	spin_lock(&mapping->i_mmap_lock);
+	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
+		if (vma->vm_flags & VM_SHARED) {
+			struct mm_struct *mm = vma->vm_mm;
+			unsigned long addr = vma_address(page, vma);
+			pte_t *pte;
+			spinlock_t *ptl;
+
+			if (addr != -EFAULT &&
+			    (pte = page_check_address(page, mm, addr, &ptl))) {
+				if (ptep_clear_flush_dirty(vma, addr, pte))
+					modified = true;
+				pte_unmap_unlock(pte, ptl);
+			}
+		}
+	}
+	spin_unlock(&mapping->i_mmap_lock);
+	if (page_test_and_clear_dirty(page))
+		modified = 1;
+	return modified;
+}
+
+/**
  * page_set_anon_rmap - setup new anonymous rmap
  * @page:	the page to add the mapping to
  * @vma:	the vm area in which the mapping is added
@@ -598,7 +635,7 @@ void page_remove_rmap(struct page *page,
 		 * faster for those pages still in swapcache.
 		 */
 		if (page_test_and_clear_dirty(page))
-			set_page_dirty(page);
+			set_page_dirty_mapping(page);
 		__dec_zone_page_state(page,
 				PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
 	}
@@ -643,7 +680,7 @@ static int try_to_unmap_one(struct page 
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
-		set_page_dirty(page);
+		set_page_dirty_mapping(page);
 
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
@@ -777,7 +814,7 @@ static void try_to_unmap_cluster(unsigne
 
 		/* Move the dirty bit to the physical page now the pte is gone. */
 		if (pte_dirty(pteval))
-			set_page_dirty(page);
+			set_page_dirty_mapping(page);
 
 		page_remove_rmap(page, vma);
 		page_cache_release(page);
Index: linux/include/linux/writeback.h
===================================================================
--- linux.orig/include/linux/writeback.h	2007-03-06 15:17:42.000000000 +0100
+++ linux/include/linux/writeback.h	2007-03-06 15:17:46.000000000 +0100
@@ -117,7 +117,6 @@ int sync_page_range(struct inode *inode,
 			loff_t pos, loff_t count);
 int sync_page_range_nolock(struct inode *inode, struct address_space *mapping,
 			   loff_t pos, loff_t count);
-void set_page_dirty_balance(struct page *page);
 void writeback_set_ratelimit(void);
 
 /* pdflush.c */
Index: linux/mm/mmap.c
===================================================================
--- linux.orig/mm/mmap.c	2007-03-06 15:17:42.000000000 +0100
+++ linux/mm/mmap.c	2007-03-06 15:17:46.000000000 +0100
@@ -222,12 +222,16 @@ void unlink_file_vma(struct vm_area_stru
 static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 {
 	struct vm_area_struct *next = vma->vm_next;
+	struct file *file = vma->vm_file;
 
 	might_sleep();
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
-	if (vma->vm_file)
-		fput(vma->vm_file);
+	if (file) {
+		if (test_and_clear_bit(AS_CMTIME, &file->f_mapping->flags))
+			file_update_time(file);
+		fput(file);
+	}
 	mpol_free(vma_policy(vma));
 	kmem_cache_free(vm_area_cachep, vma);
 	return next;
Index: linux/mm/hugetlb.c
===================================================================
--- linux.orig/mm/hugetlb.c	2007-03-06 15:17:42.000000000 +0100
+++ linux/mm/hugetlb.c	2007-03-06 15:17:46.000000000 +0100
@@ -390,7 +390,7 @@ void __unmap_hugepage_range(struct vm_ar
 
 		page = pte_page(pte);
 		if (pte_dirty(pte))
-			set_page_dirty(page);
+			set_page_dirty_mapping(page);
 		list_add(&page->lru, &page_list);
 	}
 	spin_unlock(&mm->page_table_lock);
Index: linux/mm/msync.c
===================================================================
--- linux.orig/mm/msync.c	2007-03-06 15:17:42.000000000 +0100
+++ linux/mm/msync.c	2007-03-06 15:17:46.000000000 +0100
@@ -12,6 +12,86 @@
 #include <linux/mman.h>
 #include <linux/file.h>
 #include <linux/syscalls.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
+#include <linux/pagevec.h>
+
+/*
+ * Update ctime/mtime on msync().
+ *
+ * POSIX requires, that the times are updated between a modification
+ * of the file through a memory mapping and the next msync for a
+ * region containing the modification.  The wording implies that this
+ * must be done even if the modification was through a different
+ * address space.  Ugh.
+ *
+ * Non-linear vmas are to hard to handle and they are non-standard
+ * anyway, so they are ignored for now.
+ *
+ * The "file modified" info is collected from two places:
+ *
+ *  - AS_CMTIME flag of the mapping
+ *  - the dirty bit of the ptes
+ *
+ * For memory backed filesystems all the pages in the range need to be
+ * examined.  For non-memory backed filesystems it is enough to look
+ * at the pages with the dirty tag.
+ */
+static void msync_update_file_time(struct vm_area_struct *vma,
+				   unsigned long start, unsigned long end)
+{
+	struct file *file = vma->vm_file;
+	struct address_space *mapping = file->f_mapping;
+	struct pagevec pvec;
+	pgoff_t index;
+	pgoff_t end_index;
+	bool modified;
+
+	if (!file || !(vma->vm_flags & VM_SHARED) ||
+	    (vma->vm_flags & VM_NONLINEAR))
+		return;
+
+	modified = test_and_clear_bit(AS_CMTIME, &mapping->flags);
+
+	pagevec_init(&pvec, 0);
+	index = linear_page_index(vma, start);
+	end_index = linear_page_index(vma, end);
+	while (index < end_index) {
+		int i;
+		struct address_space *mapping = file->f_mapping;
+		int nr_pages = min(end_index - index, (pgoff_t) PAGEVEC_SIZE);
+
+		if (mapping_cap_account_dirty(mapping))
+			nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
+					PAGECACHE_TAG_DIRTY, nr_pages);
+		else
+			nr_pages = pagevec_lookup(&pvec, mapping, index,
+						  nr_pages);
+		if (!nr_pages)
+			break;
+
+		for (i = 0; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			/* Skip pages which are just being read */
+			if (!PageUptodate(page))
+				continue;
+
+			lock_page(page);
+			index = page->index + 1;
+			if (page->mapping == mapping &&
+			    is_page_modified(page)) {
+				set_page_dirty(page);
+				modified = true;
+			}
+			unlock_page(page);
+		}
+		pagevec_release(&pvec);
+	}
+
+	if (modified)
+		file_update_time(file);
+}
 
 /*
  * MS_SYNC syncs the entire file - including mappings.
@@ -75,6 +155,9 @@ asmlinkage long sys_msync(unsigned long 
 			error = -EBUSY;
 			goto out_unlock;
 		}
+		if (flags & (MS_SYNC | MS_ASYNC))
+			msync_update_file_time(vma, start,
+					       min(end, vma->vm_end));
 		file = vma->vm_file;
 		start = vma->vm_end;
 		if ((flags & MS_SYNC) && file &&
Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h	2007-03-06 15:19:58.000000000 +0100
+++ linux/include/linux/rmap.h	2007-03-06 15:25:05.000000000 +0100
@@ -111,6 +111,8 @@ unsigned long page_address_in_vma(struct
  */
 int page_mkclean(struct page *);
 
+bool is_page_modified(struct page *page);
+
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)

--

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 3/8] per backing_dev dirty and writeback page accounting
  2007-03-06 18:04 [patch 0/8] VFS/VM patches Miklos Szeredi
  2007-03-06 18:04 ` [patch 1/8] fix race in clear_page_dirty_for_io() Miklos Szeredi
  2007-03-06 18:04 ` [patch 2/8] update ctime and mtime for mmaped write Miklos Szeredi
@ 2007-03-06 18:04 ` Miklos Szeredi
  2007-03-12  6:23   ` David Chinner
  2007-03-06 18:04 ` [patch 4/8] fix deadlock in balance_dirty_pages Miklos Szeredi
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Miklos Szeredi @ 2007-03-06 18:04 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: per-backing_dev-dirty-and-writeback-page-accounting.patch --]
[-- Type: text/plain, Size: 6309 bytes --]

From: Andrew Morton <akpm@osdl.org>

[tomoki.sekiyama.qu@hitachi.com: bugfix]

Miklos Szeredi <mszeredi@suse.cz>:

Changes:
 - updated to apply after clear_page_dirty_for_io() race fix

This is needed for

 - balance_dirty_pages() deadlock fix
 - fuse dirty page accounting

I have no idea how serious the scalability problems with this are.  If
they are serious, different solutions can probably be found for the
above, but this is certainly the simplest.

Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/block/ll_rw_blk.c
===================================================================
--- linux.orig/block/ll_rw_blk.c	2007-03-06 11:19:16.000000000 +0100
+++ linux/block/ll_rw_blk.c	2007-03-06 13:40:08.000000000 +0100
@@ -201,6 +201,8 @@ EXPORT_SYMBOL(blk_queue_softirq_done);
  **/
 void blk_queue_make_request(request_queue_t * q, make_request_fn * mfn)
 {
+	struct backing_dev_info *bdi = &q->backing_dev_info;
+
 	/*
 	 * set defaults
 	 */
@@ -208,9 +210,11 @@ void blk_queue_make_request(request_queu
 	blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
 	blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS);
 	q->make_request_fn = mfn;
-	q->backing_dev_info.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
-	q->backing_dev_info.state = 0;
-	q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY;
+	bdi->ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+	bdi->state = 0;
+	bdi->capabilities = BDI_CAP_MAP_COPY;
+	atomic_long_set(&bdi->nr_dirty, 0);
+	atomic_long_set(&bdi->nr_writeback, 0);
 	blk_queue_max_sectors(q, SAFE_MAX_SECTORS);
 	blk_queue_hardsect_size(q, 512);
 	blk_queue_dma_alignment(q, 511);
@@ -3922,6 +3926,19 @@ static ssize_t queue_max_hw_sectors_show
 	return queue_var_show(max_hw_sectors_kb, (page));
 }
 
+static ssize_t queue_nr_dirty_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%lu\n",
+		atomic_long_read(&q->backing_dev_info.nr_dirty));
+
+}
+
+static ssize_t queue_nr_writeback_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%lu\n",
+		atomic_long_read(&q->backing_dev_info.nr_writeback));
+
+}
 
 static struct queue_sysfs_entry queue_requests_entry = {
 	.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
@@ -3946,6 +3963,16 @@ static struct queue_sysfs_entry queue_ma
 	.show = queue_max_hw_sectors_show,
 };
 
+static struct queue_sysfs_entry queue_nr_dirty_entry = {
+	.attr = {.name = "nr_dirty", .mode = S_IRUGO },
+	.show = queue_nr_dirty_show,
+};
+
+static struct queue_sysfs_entry queue_nr_writeback_entry = {
+	.attr = {.name = "nr_writeback", .mode = S_IRUGO },
+	.show = queue_nr_writeback_show,
+};
+
 static struct queue_sysfs_entry queue_iosched_entry = {
 	.attr = {.name = "scheduler", .mode = S_IRUGO | S_IWUSR },
 	.show = elv_iosched_show,
@@ -3957,6 +3984,8 @@ static struct attribute *default_attrs[]
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
+	&queue_nr_dirty_entry.attr,
+	&queue_nr_writeback_entry.attr,
 	&queue_iosched_entry.attr,
 	NULL,
 };
Index: linux/include/linux/backing-dev.h
===================================================================
--- linux.orig/include/linux/backing-dev.h	2007-03-06 11:19:18.000000000 +0100
+++ linux/include/linux/backing-dev.h	2007-03-06 13:40:08.000000000 +0100
@@ -28,6 +28,8 @@ struct backing_dev_info {
 	unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */
 	unsigned long state;	/* Always use atomic bitops on this */
 	unsigned int capabilities; /* Device capabilities */
+	atomic_long_t nr_dirty;	/* Pages dirty against this BDI */
+	atomic_long_t nr_writeback;/* Pages under writeback against this BDI */
 	congested_fn *congested_fn; /* Function pointer if device is md/dm */
 	void *congested_data;	/* Pointer to aux data for congested func */
 	void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c	2007-03-06 13:28:26.000000000 +0100
+++ linux/mm/page-writeback.c	2007-03-06 13:45:55.000000000 +0100
@@ -743,6 +743,7 @@ void generic_page_dirtied(struct page *p
 	if (mapping) { /* Race with truncate? */
 		if (mapping_cap_account_dirty(mapping)) {
 			__inc_zone_page_state(page, NR_FILE_DIRTY);
+			atomic_long_inc(&mapping->backing_dev_info->nr_dirty);
 			task_io_account_write(PAGE_CACHE_SIZE);
 		}
 		radix_tree_tag_set(&mapping->page_tree,
@@ -896,6 +897,7 @@ int clear_page_dirty_for_io(struct page 
 			}
 
 			dec_zone_page_state(page, NR_FILE_DIRTY);
+			atomic_long_dec(&mapping->backing_dev_info->nr_dirty);
 		}
 		return 1;
 	}
@@ -913,10 +915,13 @@ int test_clear_page_writeback(struct pag
 
 		write_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestClearPageWriteback(page);
-		if (ret)
+		if (ret) {
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
+			atomic_long_dec(&mapping->backing_dev_info->
+					nr_writeback);
+		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestClearPageWriteback(page);
@@ -934,10 +939,13 @@ int test_set_page_writeback(struct page 
 
 		write_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestSetPageWriteback(page);
-		if (!ret)
+		if (!ret) {
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
+			atomic_long_inc(&mapping->backing_dev_info->
+					nr_writeback);
+		}
 		if (!PageDirty(page))
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
Index: linux/mm/truncate.c
===================================================================
--- linux.orig/mm/truncate.c	2007-03-06 11:19:18.000000000 +0100
+++ linux/mm/truncate.c	2007-03-06 13:40:08.000000000 +0100
@@ -70,6 +70,7 @@ void cancel_dirty_page(struct page *page
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			atomic_long_dec(&mapping->backing_dev_info->nr_dirty);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			if (account_size)
 				task_io_account_cancelled_write(account_size);

--

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 4/8] fix deadlock in balance_dirty_pages
  2007-03-06 18:04 [patch 0/8] VFS/VM patches Miklos Szeredi
                   ` (2 preceding siblings ...)
  2007-03-06 18:04 ` [patch 3/8] per backing_dev dirty and writeback page accounting Miklos Szeredi
@ 2007-03-06 18:04 ` Miklos Szeredi
  2007-03-06 18:04 ` [patch 5/8] fix deadlock in throttle_vm_writeout Miklos Szeredi
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Miklos Szeredi @ 2007-03-06 18:04 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: dirty_balancing_fix.patch --]
[-- Type: text/plain, Size: 4175 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

This deadlock happens, when dirty pages from one filesystem are
written back through another filesystem.  It easiest to demonstrate
with fuse although it could affect looback mounts as well (see
following patches).

Let's call the filesystems A(bove) and B(elow).  Process Pr_a is
writing to A, and process Pr_b is writing to B.

Pr_a is bash-shared-mapping.  Pr_b is the fuse filesystem daemon
(fusexmp_fh), for simplicity let's assume that Pr_b is single
threaded.

These are the simplified stack traces of these processes after the
deadlock:

Pr_a (bash-shared-mapping):

  (block on queue)
  fuse_writepage
  generic_writepages
  writeback_inodes
  balance_dirty_pages
  balance_dirty_pages_ratelimited_nr
  set_page_dirty_mapping_balance
  do_no_page


Pr_b (fusexmp_fh):

  io_schedule_timeout
  congestion_wait
  balance_dirty_pages
  balance_dirty_pages_ratelimited_nr
  generic_file_buffered_write
  generic_file_aio_write
  ext3_file_write
  do_sync_write
  vfs_write
  sys_pwrite64


Thanks to the aggressive nature of Pr_a, it can happen, that

  nr_file_dirty > dirty_thresh + margin

This is due to both nr_dirty growing and dirty_thresh shrinking, which
in turn is due to nr_file_mapped rapidly growing.  The exact size of
the margin at which the deadlock happens is not known, but it's around
100 pages.

At this point Pr_a enters balance_dirty_pages and starts to write back
some if it's dirty pages.  After submitting some requests, it blocks
on the request queue.

The first write request will trigger Pr_b to perform a write()
syscall.  This will submit a write request to the block device and
then may enter balance_dirty_pages().

The condition for exiting balance_dirty_pages() is

 - either that write_chunk pages have been written

 - or nr_file_dirty + nr_writeback < dirty_thresh

It is entirely possible that less than write_chunk pages were written,
in which case balance_dirty_pages() will not exit even after all the
submitted requests have been succesfully completed.

Which means that the write() syscall does not return.

Which means, that no more dirty pages from A will be written back, and
neither nr_writeback nor nr_file_dirty will decrease.

Which means, that balance_dirty_pages() will loop forever.

What if Pr_b is multithreaded?  The first thread will enter
balance_dirty_pages() and loop there as shown above.  It will hold
i_mutex for the inode, taken in generic_file_aio_write().

The other theads now try to write back more data into the same file,
but will block on i_mutex.  So even with unlimited number of threads
no progress is made.

Q.E.D.

The solution is to exit balance_dirty_pages() on the condition, that
there are only a few dirty + writeback pages for this backing dev.  This
makes sure, that there is always some progress with this setup.

The number of outstanding dirty + written pages is limited to 8, which
means that when over the threshold (dirty_exceeded == 1), each
filesystem may only effectively pin a maximum of 16 (+8 because of
ratelimiting) extra pages.

Note: a similar safety vent is always needed if there's a global limit
for the dirty+writeback pages, even if in the future there will be
some per-queue (or other) soft limit.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c	2007-02-27 14:41:07.000000000 +0100
+++ linux/mm/page-writeback.c	2007-02-27 14:41:07.000000000 +0100
@@ -201,6 +201,17 @@ static void balance_dirty_pages(struct a
 		if (!dirty_exceeded)
 			dirty_exceeded = 1;
 
+		/*
+		 * Acquit producer of dirty pages if there's little or
+		 * nothing to write back to this particular queue.
+		 *
+		 * Without this check a deadlock is possible for if
+		 * one filesystem is writing data through another.
+		 */
+		if (atomic_long_read(&bdi->nr_dirty) +
+		    atomic_long_read(&bdi->nr_writeback) < 8)
+			break;
+
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
 		 * filesystems (i.e. NFS) in which data may have been

--

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 5/8] fix deadlock in throttle_vm_writeout
  2007-03-06 18:04 [patch 0/8] VFS/VM patches Miklos Szeredi
                   ` (3 preceding siblings ...)
  2007-03-06 18:04 ` [patch 4/8] fix deadlock in balance_dirty_pages Miklos Szeredi
@ 2007-03-06 18:04 ` Miklos Szeredi
  2007-03-06 18:04 ` [patch 6/8] balance dirty pages from loop device Miklos Szeredi
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Miklos Szeredi @ 2007-03-06 18:04 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: throttle_vm_writeout_fix.patch --]
[-- Type: text/plain, Size: 4480 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

This deadlock is similar to the one in balance_dirty_pages, but
instead of waiting in balance_dirty_pages after submitting a write
request, it happens during a memory allocation for filesystem B before
submitting a write request.

It is easy to reproduce on a machine with not too much memory.
E.g. try this on 2.6.21-rc1 UML with 32MB (works on physical hw as
well):

  dd if=/dev/zero of=/tmp/tmp.img bs=1048576 count=40
  mke2fs -j -F /tmp/tmp.img
  mkdir /tmp/img
  mount -oloop /tmp/tmp.img /tmp/img
  bash-shared-mapping /tmp/img/foo 30000000

The deadlock doesn't happen immediately, sometimes only after a few
minutes.

Simplified stack trace for bash-shared-mapping after the deadlock:

  io_schedule_timeout
  congestion_wait
  balance_dirty_pages
  balance_dirty_pages_ratelimited_nr
  generic_file_buffered_write
  __generic_file_aio_write_nolock
  generic_file_aio_write
  ext3_file_write
  do_sync_write
  vfs_write
  sys_pwrite64

and for [loop0]:

  io_schedule_timeout
  congestion_wait
  throttle_vm_writeout
  shrink_zone
  shrink_zones
  try_to_free_pages
  __alloc_pages
  find_or_create_page
  do_lo_send_aops
  lo_send
  do_bio_filebacked
  loop_thread

The requirement for the deadlock is that

  nr_writeback > dirty_thresh * 1.1 + margin

Again margin seems to be in the 100 page range.

The task of throttle_vm_writeout is to limit the rate at which
under-writeback pages are created due to swapping.  There's no other
way direct reclaim can increase the nr_writeback + nr_file_dirty.

So when there are few or no under-swap pages, it is safe for this
function to return.  This ensures, that there's progress with writing
back dirty pages.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/include/linux/swap.h
===================================================================
--- linux.orig/include/linux/swap.h	2007-02-27 14:40:55.000000000 +0100
+++ linux/include/linux/swap.h	2007-02-27 14:41:08.000000000 +0100
@@ -279,10 +279,14 @@ static inline void disable_swap_token(vo
 	put_swap_token(swap_token_mm);
 }
 
+#define nr_swap_writeback \
+	atomic_long_read(&swapper_space.backing_dev_info->nr_writeback)
+
 #else /* CONFIG_SWAP */
 
 #define total_swap_pages			0
 #define total_swapcache_pages			0UL
+#define nr_swap_writeback			0UL
 
 #define si_swapinfo(val) \
 	do { (val)->freeswap = (val)->totalswap = 0; } while (0)
Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c	2007-02-27 14:41:07.000000000 +0100
+++ linux/mm/page-writeback.c	2007-02-27 14:41:08.000000000 +0100
@@ -33,6 +33,7 @@
 #include <linux/syscalls.h>
 #include <linux/buffer_head.h>
 #include <linux/pagevec.h>
+#include <linux/swap.h>
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -303,6 +304,21 @@ void throttle_vm_writeout(void)
 	long dirty_thresh;
 
         for ( ; ; ) {
+		/*
+		 * If there's no swapping going on, don't throttle.
+		 *
+		 * Starting writeback against mapped pages shouldn't
+		 * be a problem, as that doesn't increase the
+		 * sum of dirty + writeback.
+		 *
+		 * Without this, a deadlock is possible (also see
+		 * comment in balance_dirty_pages).  This has been
+		 * observed with running bash-shared-mapping on a
+		 * loopback mount.
+		 */
+		if (nr_swap_writeback < 16)
+			break;
+
 		get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
 
                 /*
@@ -314,6 +330,7 @@ void throttle_vm_writeout(void)
                 if (global_page_state(NR_UNSTABLE_NFS) +
 			global_page_state(NR_WRITEBACK) <= dirty_thresh)
                         	break;
+
                 congestion_wait(WRITE, HZ/10);
         }
 }
Index: linux/mm/page_io.c
===================================================================
--- linux.orig/mm/page_io.c	2007-02-27 14:40:55.000000000 +0100
+++ linux/mm/page_io.c	2007-02-27 14:41:08.000000000 +0100
@@ -70,6 +70,7 @@ static int end_swap_bio_write(struct bio
 		ClearPageReclaim(page);
 	}
 	end_page_writeback(page);
+	atomic_long_dec(&swapper_space.backing_dev_info->nr_writeback);
 	bio_put(bio);
 	return 0;
 }
@@ -121,6 +122,7 @@ int swap_writepage(struct page *page, st
 	if (wbc->sync_mode == WB_SYNC_ALL)
 		rw |= (1 << BIO_RW_SYNC);
 	count_vm_event(PSWPOUT);
+	atomic_long_inc(&swapper_space.backing_dev_info->nr_writeback);
 	set_page_writeback(page);
 	unlock_page(page);
 	submit_bio(rw, bio);

--

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 6/8] balance dirty pages from loop device
  2007-03-06 18:04 [patch 0/8] VFS/VM patches Miklos Szeredi
                   ` (4 preceding siblings ...)
  2007-03-06 18:04 ` [patch 5/8] fix deadlock in throttle_vm_writeout Miklos Szeredi
@ 2007-03-06 18:04 ` Miklos Szeredi
  2007-03-06 18:04 ` [patch 7/8] add filesystem subtype support Miklos Szeredi
  2007-03-06 18:04 ` [patch 8/8] consolidate generic_writepages and mpage_writepages fix Miklos Szeredi
  7 siblings, 0 replies; 28+ messages in thread
From: Miklos Szeredi @ 2007-03-06 18:04 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: loopback_dirty_fix.patch --]
[-- Type: text/plain, Size: 981 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

The function do_lo_send_aops() should call
balance_dirty_pages_ratelimited() after each page similarly to
generic_file_buffered_write().

Without this, writing the loop device directly (not through a
filesystem) is very slow, and also slows the whole system down,
because nr_dirty is constantly over the limit.

Beware: this patch without the fix to balance_dirty_pages() makes a
loopback mounted filesystem prone to deadlock.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/drivers/block/loop.c
===================================================================
--- linux.orig/drivers/block/loop.c	2007-02-27 14:40:55.000000000 +0100
+++ linux/drivers/block/loop.c	2007-02-27 14:41:08.000000000 +0100
@@ -275,6 +275,8 @@ static int do_lo_send_aops(struct loop_d
 		pos += size;
 		unlock_page(page);
 		page_cache_release(page);
+		balance_dirty_pages_ratelimited(mapping);
+		cond_resched();
 	}
 	ret = 0;
 out:

--

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 7/8] add filesystem subtype support
  2007-03-06 18:04 [patch 0/8] VFS/VM patches Miklos Szeredi
                   ` (5 preceding siblings ...)
  2007-03-06 18:04 ` [patch 6/8] balance dirty pages from loop device Miklos Szeredi
@ 2007-03-06 18:04 ` Miklos Szeredi
  2007-03-06 18:04 ` [patch 8/8] consolidate generic_writepages and mpage_writepages fix Miklos Szeredi
  7 siblings, 0 replies; 28+ messages in thread
From: Miklos Szeredi @ 2007-03-06 18:04 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: filesystem_subtypes.patch --]
[-- Type: text/plain, Size: 6838 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

There's a slight problem with filesystem type representation in fuse
based filesystems.

>From the kernel's view, there are just two filesystem types: fuse and
fuseblk.  From the user's view there are lots of different filesystem
types.  The user is not even much concerned if the filesystem is fuse
based or not.  So there's a conflict of interest in how this should be
represented in fstab, mtab and /proc/mounts.

The current scheme is to encode the real filesystem type in the mount
source.  So an sshfs mount looks like this:

  sshfs#user@server:/   /mnt/server    fuse   rw,nosuid,nodev,...

This url-ish syntax works OK for sshfs and similar filesystems.
However for block device based filesystems (ntfs-3g, zfs) it doesn't
work, since the kernel expects the mount source to be a real device
name.

A possibly better scheme would be to encode the real type in the type
field as "type.subtype".  So fuse mounts would look like this:

  /dev/hda1       /mnt/windows   fuseblk.ntfs-3g   rw,...
  user@server:/   /mnt/server    fuse.sshfs        rw,nosuid,nodev,...

This patch adds the necessary code to the kernel so that this can be
correctly displayed in /proc/mounts.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/fs/filesystems.c
===================================================================
--- linux.orig/fs/filesystems.c	2007-03-06 10:51:19.000000000 +0100
+++ linux/fs/filesystems.c	2007-03-06 13:48:45.000000000 +0100
@@ -41,11 +41,12 @@ void put_filesystem(struct file_system_t
 	module_put(fs->owner);
 }
 
-static struct file_system_type **find_filesystem(const char *name)
+static struct file_system_type **find_filesystem(const char *name, unsigned len)
 {
 	struct file_system_type **p;
 	for (p=&file_systems; *p; p=&(*p)->next)
-		if (strcmp((*p)->name,name) == 0)
+		if (strlen((*p)->name) == len &&
+		    strncmp((*p)->name, name, len) == 0)
 			break;
 	return p;
 }
@@ -68,11 +69,12 @@ int register_filesystem(struct file_syst
 	int res = 0;
 	struct file_system_type ** p;
 
+	BUG_ON(strchr(fs->name, '.'));
 	if (fs->next)
 		return -EBUSY;
 	INIT_LIST_HEAD(&fs->fs_supers);
 	write_lock(&file_systems_lock);
-	p = find_filesystem(fs->name);
+	p = find_filesystem(fs->name, strlen(fs->name));
 	if (*p)
 		res = -EBUSY;
 	else
@@ -215,19 +217,26 @@ int get_filesystem_list(char * buf)
 struct file_system_type *get_fs_type(const char *name)
 {
 	struct file_system_type *fs;
+	const char *dot = strchr(name, '.');
+	unsigned len = dot ? dot - name : strlen(name);
 
 	read_lock(&file_systems_lock);
-	fs = *(find_filesystem(name));
+	fs = *(find_filesystem(name, len));
 	if (fs && !try_module_get(fs->owner))
 		fs = NULL;
 	read_unlock(&file_systems_lock);
-	if (!fs && (request_module("%s", name) == 0)) {
+	if (!fs && (request_module("%.*s", len, name) == 0)) {
 		read_lock(&file_systems_lock);
-		fs = *(find_filesystem(name));
+		fs = *(find_filesystem(name, len));
 		if (fs && !try_module_get(fs->owner))
 			fs = NULL;
 		read_unlock(&file_systems_lock);
 	}
+
+	if (dot && fs && !(fs->fs_flags & FS_HAS_SUBTYPE)) {
+		put_filesystem(fs);
+		fs = NULL;
+	}
 	return fs;
 }
 
Index: linux/fs/fuse/inode.c
===================================================================
--- linux.orig/fs/fuse/inode.c	2007-03-06 10:51:19.000000000 +0100
+++ linux/fs/fuse/inode.c	2007-03-06 13:48:45.000000000 +0100
@@ -634,6 +634,7 @@ static int fuse_get_sb(struct file_syste
 static struct file_system_type fuse_fs_type = {
 	.owner		= THIS_MODULE,
 	.name		= "fuse",
+	.fs_flags	= FS_HAS_SUBTYPE,
 	.get_sb		= fuse_get_sb,
 	.kill_sb	= kill_anon_super,
 };
@@ -650,6 +651,7 @@ static int fuse_get_sb_blk(struct file_s
 static struct file_system_type fuseblk_fs_type = {
 	.owner		= THIS_MODULE,
 	.name		= "fuseblk",
+	.fs_flags	= FS_HAS_SUBTYPE,
 	.get_sb		= fuse_get_sb_blk,
 	.kill_sb	= kill_block_super,
 	.fs_flags	= FS_REQUIRES_DEV,
Index: linux/fs/namespace.c
===================================================================
--- linux.orig/fs/namespace.c	2007-03-06 11:19:17.000000000 +0100
+++ linux/fs/namespace.c	2007-03-06 13:48:45.000000000 +0100
@@ -377,6 +377,10 @@ static int show_vfsmnt(struct seq_file *
 	seq_path(m, mnt, mnt->mnt_root, " \t\n\\");
 	seq_putc(m, ' ');
 	mangle(m, mnt->mnt_sb->s_type->name);
+	if (mnt->mnt_sb->s_subtype && mnt->mnt_sb->s_subtype[0]) {
+		seq_putc(m, '.');
+		mangle(m, mnt->mnt_sb->s_subtype);
+	}
 	seq_puts(m, mnt->mnt_sb->s_flags & MS_RDONLY ? " ro" : " rw");
 	for (fs_infop = fs_info; fs_infop->flag; fs_infop++) {
 		if (mnt->mnt_sb->s_flags & fs_infop->flag)
Index: linux/fs/super.c
===================================================================
--- linux.orig/fs/super.c	2007-03-06 10:51:20.000000000 +0100
+++ linux/fs/super.c	2007-03-06 13:48:45.000000000 +0100
@@ -107,6 +107,7 @@ out:
 static inline void destroy_super(struct super_block *s)
 {
 	security_sb_free(s);
+	kfree(s->s_subtype);
 	kfree(s);
 }
 
@@ -919,6 +920,29 @@ out:
 
 EXPORT_SYMBOL_GPL(vfs_kern_mount);
 
+static struct vfsmount *fs_set_subtype(struct vfsmount *mnt, const char *fstype)
+{
+	int err;
+	const char *subtype = strchr(fstype, '.');
+	if (subtype) {
+		subtype++;
+		err = -EINVAL;
+		if (!subtype[0])
+			goto err;
+	} else
+		subtype = "";
+
+	mnt->mnt_sb->s_subtype = kstrdup(subtype, GFP_KERNEL);
+	err = -ENOMEM;
+	if (!mnt->mnt_sb->s_subtype)
+		goto err;
+	return mnt;
+
+ err:
+	mntput(mnt);
+	return ERR_PTR(err);
+}
+
 struct vfsmount *
 do_kern_mount(const char *fstype, int flags, const char *name, void *data)
 {
@@ -927,6 +951,9 @@ do_kern_mount(const char *fstype, int fl
 	if (!type)
 		return ERR_PTR(-ENODEV);
 	mnt = vfs_kern_mount(type, flags, name, data);
+	if (!IS_ERR(mnt) && (type->fs_flags & FS_HAS_SUBTYPE) &&
+	    !mnt->mnt_sb->s_subtype)
+		mnt = fs_set_subtype(mnt, fstype);
 	put_filesystem(type);
 	return mnt;
 }
Index: linux/include/linux/fs.h
===================================================================
--- linux.orig/include/linux/fs.h	2007-03-06 11:20:07.000000000 +0100
+++ linux/include/linux/fs.h	2007-03-06 13:48:45.000000000 +0100
@@ -91,6 +91,7 @@ extern int dir_notify_enable;
 /* public flags for file_system_type */
 #define FS_REQUIRES_DEV 1 
 #define FS_BINARY_MOUNTDATA 2
+#define FS_HAS_SUBTYPE 4
 #define FS_REVAL_DOT	16384	/* Check the paths ".", ".." for staleness */
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move()
 					 * during rename() internally.
@@ -954,6 +955,12 @@ struct super_block {
 	/* Granularity of c/m/atime in ns.
 	   Cannot be worse than a second */
 	u32		   s_time_gran;
+
+	/*
+	 * Filesystem subtype.  If non-empty the filesystem type field
+	 * in /proc/mounts will be "type.subtype"
+	 */
+	char *s_subtype;
 };
 
 extern struct timespec current_fs_time(struct super_block *sb);

--

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 8/8] consolidate generic_writepages and mpage_writepages fix
  2007-03-06 18:04 [patch 0/8] VFS/VM patches Miklos Szeredi
                   ` (6 preceding siblings ...)
  2007-03-06 18:04 ` [patch 7/8] add filesystem subtype support Miklos Szeredi
@ 2007-03-06 18:04 ` Miklos Szeredi
  2007-03-07 20:46   ` Andrew Morton
  7 siblings, 1 reply; 28+ messages in thread
From: Miklos Szeredi @ 2007-03-06 18:04 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: writepages_fix.patch --]
[-- Type: text/plain, Size: 683 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

Fix NULL pointer dereference in __mpage_writepage.  This probably
doesn't matter in practice, but this is the original behavior.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/fs/mpage.c
===================================================================
--- linux.orig/fs/mpage.c	2007-03-06 17:30:09.000000000 +0100
+++ linux/fs/mpage.c	2007-03-06 17:30:33.000000000 +0100
@@ -662,7 +662,7 @@ confused:
 	if (bio)
 		bio = mpage_bio_submit(WRITE, bio);
 
-	if (mpd->use_writepage) {
+	if (mpd->use_writepage && mapping->a_ops->writepage) {
 		ret = mapping->a_ops->writepage(page, wbc);
 	} else {
 		ret = -EAGAIN;

--

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/8] update ctime and mtime for mmaped write
  2007-03-06 18:04 ` [patch 2/8] update ctime and mtime for mmaped write Miklos Szeredi
@ 2007-03-06 20:32   ` Peter Zijlstra
  2007-03-06 21:24     ` Miklos Szeredi
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-06 20:32 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-fsdevel, staubach, hugh

On Tue, 2007-03-06 at 19:04 +0100, Miklos Szeredi wrote:
> plain text document attachment (mmap_mtime.patch)

> Index: linux/mm/rmap.c
> ===================================================================
> --- linux.orig/mm/rmap.c	2007-03-06 15:17:46.000000000 +0100
> +++ linux/mm/rmap.c	2007-03-06 15:17:46.000000000 +0100
> @@ -498,6 +498,43 @@ int page_mkclean(struct page *page)
>  }
>  
>  /**
> + * is_page_modified - check and clear the dirty bit for all mappings of a page
> + * @page:	the page to check
> + */
> +bool is_page_modified(struct page *page)
> +{
> +	struct address_space *mapping = page->mapping;
> +	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> +	struct vm_area_struct *vma;
> +	struct prio_tree_iter iter;
> +	bool modified = false;
> +
> +	BUG_ON(!mapping);
> +	BUG_ON(!page_mapped(page));
> +
> +	spin_lock(&mapping->i_mmap_lock);
> +	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
> +		if (vma->vm_flags & VM_SHARED) {
> +			struct mm_struct *mm = vma->vm_mm;
> +			unsigned long addr = vma_address(page, vma);
> +			pte_t *pte;
> +			spinlock_t *ptl;
> +
> +			if (addr != -EFAULT &&
> +			    (pte = page_check_address(page, mm, addr, &ptl))) {
> +				if (ptep_clear_flush_dirty(vma, addr, pte))
> +					modified = true;
> +				pte_unmap_unlock(pte, ptl);
> +			}
> +		}
> +	}
> +	spin_unlock(&mapping->i_mmap_lock);
> +	if (page_test_and_clear_dirty(page))
> +		modified = 1;
> +	return modified;
> +}
> +
> +/**
>   * page_set_anon_rmap - setup new anonymous rmap
>   * @page:	the page to add the mapping to
>   * @vma:	the vm area in which the mapping is added

I'm not liking this, its not a constant operation as the name implies.
And it style is a bit out of line with the rest of rmap.

The thing it actually does is page_mkclean(), all it doesn't do is
setting the pte read-only.

I can understand you wanting to avoid the overhead of the minor faults
resulting from using page_mkclean(), but I'm not sure its worth it.

> Index: linux/mm/msync.c
> ===================================================================
> --- linux.orig/mm/msync.c	2007-03-06 15:17:42.000000000 +0100
> +++ linux/mm/msync.c	2007-03-06 15:17:46.000000000 +0100
> @@ -12,6 +12,86 @@
>  #include <linux/mman.h>
>  #include <linux/file.h>
>  #include <linux/syscalls.h>
> +#include <linux/pagemap.h>
> +#include <linux/rmap.h>
> +#include <linux/pagevec.h>
> +
> +/*
> + * Update ctime/mtime on msync().
> + *
> + * POSIX requires, that the times are updated between a modification
> + * of the file through a memory mapping and the next msync for a
> + * region containing the modification.  The wording implies that this
> + * must be done even if the modification was through a different
> + * address space.  Ugh.
> + *
> + * Non-linear vmas are to hard to handle and they are non-standard
> + * anyway, so they are ignored for now.
> + *
> + * The "file modified" info is collected from two places:
> + *
> + *  - AS_CMTIME flag of the mapping
> + *  - the dirty bit of the ptes
> + *
> + * For memory backed filesystems all the pages in the range need to be
> + * examined.  For non-memory backed filesystems it is enough to look
> + * at the pages with the dirty tag.
> + */
> +static void msync_update_file_time(struct vm_area_struct *vma,
> +				   unsigned long start, unsigned long end)
> +{
> +	struct file *file = vma->vm_file;
> +	struct address_space *mapping = file->f_mapping;
> +	struct pagevec pvec;
> +	pgoff_t index;
> +	pgoff_t end_index;
> +	bool modified;
> +
> +	if (!file || !(vma->vm_flags & VM_SHARED) ||
> +	    (vma->vm_flags & VM_NONLINEAR))
> +		return;
> +
> +	modified = test_and_clear_bit(AS_CMTIME, &mapping->flags);
> +
> +	pagevec_init(&pvec, 0);
> +	index = linear_page_index(vma, start);
> +	end_index = linear_page_index(vma, end);
> +	while (index < end_index) {
> +		int i;
> +		struct address_space *mapping = file->f_mapping;
> +		int nr_pages = min(end_index - index, (pgoff_t) PAGEVEC_SIZE);
> +
> +		if (mapping_cap_account_dirty(mapping))
> +			nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
> +					PAGECACHE_TAG_DIRTY, nr_pages);
> +		else
> +			nr_pages = pagevec_lookup(&pvec, mapping, index,
> +						  nr_pages);
> +		if (!nr_pages)
> +			break;
> +
> +		for (i = 0; i < nr_pages; i++) {
> +			struct page *page = pvec.pages[i];
> +
> +			/* Skip pages which are just being read */
> +			if (!PageUptodate(page))
> +				continue;
> +
> +			lock_page(page);
> +			index = page->index + 1;
> +			if (page->mapping == mapping &&
> +			    is_page_modified(page)) {
> +				set_page_dirty(page);
> +				modified = true;
> +			}
> +			unlock_page(page);
> +		}
> +		pagevec_release(&pvec);
> +	}
> +
> +	if (modified)
> +		file_update_time(file);
> +}
>  
>  /*
>   * MS_SYNC syncs the entire file - including mappings.

Something bothers me, although I'm not sure what yet..


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/8] update ctime and mtime for mmaped write
  2007-03-06 20:32   ` Peter Zijlstra
@ 2007-03-06 21:24     ` Miklos Szeredi
  2007-03-06 21:47       ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Miklos Szeredi @ 2007-03-06 21:24 UTC (permalink / raw)
  To: a.p.zijlstra; +Cc: akpm, linux-kernel, linux-fsdevel, staubach, hugh

> I'm not liking this, its not a constant operation as the name implies.

OK, I'll think of something better.

> And it style is a bit out of line with the rest of rmap.
> 
> The thing it actually does is page_mkclean(), all it doesn't do is
> setting the pte read-only.
> 
> I can understand you wanting to avoid the overhead of the minor faults
> resulting from using page_mkclean(), but I'm not sure its worth it.

It would be nice if the cost of MS_ASYNC wouldn't be too high.  And I
do have the feeling that minor faults are far more expensive than
cleaning the dirty bit in the ptes.

Do you have any numbers?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/8] update ctime and mtime for mmaped write
  2007-03-06 21:24     ` Miklos Szeredi
@ 2007-03-06 21:47       ` Peter Zijlstra
  2007-03-06 22:00         ` Miklos Szeredi
  2007-03-06 22:07         ` Peter Zijlstra
  0 siblings, 2 replies; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-06 21:47 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-fsdevel, staubach, hugh

On Tue, 2007-03-06 at 22:24 +0100, Miklos Szeredi wrote:
> > I'm not liking this, its not a constant operation as the name implies.
> 
> OK, I'll think of something better.
> 
> > And it style is a bit out of line with the rest of rmap.
> > 
> > The thing it actually does is page_mkclean(), all it doesn't do is
> > setting the pte read-only.
> > 
> > I can understand you wanting to avoid the overhead of the minor faults
> > resulting from using page_mkclean(), but I'm not sure its worth it.
> 
> It would be nice if the cost of MS_ASYNC wouldn't be too high.  And I
> do have the feeling that minor faults are far more expensive than
> cleaning the dirty bit in the ptes.
> 
> Do you have any numbers?

None what so ever, but I always think of msync as a rare function
(infrequent when compared to (minor) faults overall). But I don't have
numbers backing that up either.

Also, the radix tree scan you do isn't exactly cheap either. 

So what I was wondering is whether its worth optimizing this at the cost
of another rmap walker. (one with very dubious semantics at that - it
clears the pte dirty bit but doesn't particularly care about that nor
does it respect the PG_dirty / PTE dirty relation)


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/8] update ctime and mtime for mmaped write
  2007-03-06 21:47       ` Peter Zijlstra
@ 2007-03-06 22:00         ` Miklos Szeredi
  2007-03-06 22:07         ` Peter Zijlstra
  1 sibling, 0 replies; 28+ messages in thread
From: Miklos Szeredi @ 2007-03-06 22:00 UTC (permalink / raw)
  To: a.p.zijlstra; +Cc: akpm, linux-kernel, linux-fsdevel, staubach, hugh

> > > I can understand you wanting to avoid the overhead of the minor faults
> > > resulting from using page_mkclean(), but I'm not sure its worth it.
> > 
> > It would be nice if the cost of MS_ASYNC wouldn't be too high.  And I
> > do have the feeling that minor faults are far more expensive than
> > cleaning the dirty bit in the ptes.
> > 
> > Do you have any numbers?
> 
> None what so ever, but I always think of msync as a rare function
> (infrequent when compared to (minor) faults overall). But I don't have
> numbers backing that up either.

It depends entirely on the usage pattern.  I can imagine this sort of
use:

  mmap
  write lots of data to memory
  msync(MS_ASYNC)
  overwrite previous data
  msync(MS_ASYNC)
  ...

In this case write protecting and faulting the pages will be slower,
than just checking the page tables.

> Also, the radix tree scan you do isn't exactly cheap either. 
> 
> So what I was wondering is whether its worth optimizing this at the cost
> of another rmap walker. (one with very dubious semantics at that - it
> clears the pte dirty bit but doesn't particularly care about that nor
> does it respect the PG_dirty / PTE dirty relation)

I don't think this is dubious.  The pte dirty bit in this case means,
that the page was modified _since_the_last_call_ of this function.

The PG_dirty on the other hand means, that the page will need to be
written back.  So they have completely different meanings.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/8] update ctime and mtime for mmaped write
  2007-03-06 21:47       ` Peter Zijlstra
  2007-03-06 22:00         ` Miklos Szeredi
@ 2007-03-06 22:07         ` Peter Zijlstra
  2007-03-06 22:18           ` Miklos Szeredi
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-06 22:07 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-fsdevel, staubach, hugh

On Tue, 2007-03-06 at 22:47 +0100, Peter Zijlstra wrote:
> On Tue, 2007-03-06 at 22:24 +0100, Miklos Szeredi wrote:
> > > I'm not liking this, its not a constant operation as the name implies.
> > 
> > OK, I'll think of something better.
> > 
> > > And it style is a bit out of line with the rest of rmap.
> > > 
> > > The thing it actually does is page_mkclean(), all it doesn't do is
> > > setting the pte read-only.
> > > 
> > > I can understand you wanting to avoid the overhead of the minor faults
> > > resulting from using page_mkclean(), but I'm not sure its worth it.
> > 
> > It would be nice if the cost of MS_ASYNC wouldn't be too high.  And I
> > do have the feeling that minor faults are far more expensive than
> > cleaning the dirty bit in the ptes.
> > 
> > Do you have any numbers?
> 
> None what so ever, but I always think of msync as a rare function
> (infrequent when compared to (minor) faults overall). But I don't have
> numbers backing that up either.
> 
> Also, the radix tree scan you do isn't exactly cheap either. 
> 
> So what I was wondering is whether its worth optimizing this at the cost
> of another rmap walker. (one with very dubious semantics at that - it
> clears the pte dirty bit but doesn't particularly care about that nor
> does it respect the PG_dirty / PTE dirty relation)

What this functionality requires is that MS_ASYNC is a full barrier wrt.
dirtyness. That is, we want to call set_page_dirty_mappig() as soon as
we touch a page in a dirtying fashion after MS_{,A}SYNC gets called.

Hence we need the full page_mkclean() functionality, otherwise we don't
set AS_CMTIME again in time.






^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/8] update ctime and mtime for mmaped write
  2007-03-06 22:07         ` Peter Zijlstra
@ 2007-03-06 22:18           ` Miklos Szeredi
  2007-03-06 22:28             ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Miklos Szeredi @ 2007-03-06 22:18 UTC (permalink / raw)
  To: a.p.zijlstra; +Cc: akpm, linux-kernel, linux-fsdevel, staubach, hugh

> > None what so ever, but I always think of msync as a rare function
> > (infrequent when compared to (minor) faults overall). But I don't have
> > numbers backing that up either.
> > 
> > Also, the radix tree scan you do isn't exactly cheap either. 
> > 
> > So what I was wondering is whether its worth optimizing this at the cost
> > of another rmap walker. (one with very dubious semantics at that - it
> > clears the pte dirty bit but doesn't particularly care about that nor
> > does it respect the PG_dirty / PTE dirty relation)
> 
> What this functionality requires is that MS_ASYNC is a full barrier wrt.
> dirtyness. That is, we want to call set_page_dirty_mappig() as soon as
> we touch a page in a dirtying fashion after MS_{,A}SYNC gets called.
> 
> Hence we need the full page_mkclean() functionality, otherwise we don't
> set AS_CMTIME again in time.

AS_CMTIME is only for the case, when the "file modified since the last
msync" info is lost from the ptes, e.g. because of page reclaim.

So it doesn't matter if AS_CMTIME is not set, is_page_modified() will
provide the necessary barrier.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 1/8] fix race in clear_page_dirty_for_io()
  2007-03-06 18:04 ` [patch 1/8] fix race in clear_page_dirty_for_io() Miklos Szeredi
@ 2007-03-06 22:25   ` Andrew Morton
  0 siblings, 0 replies; 28+ messages in thread
From: Andrew Morton @ 2007-03-06 22:25 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-kernel, linux-fsdevel, torvalds

On Tue, 06 Mar 2007 19:04:44 +0100
Miklos Szeredi <miklos@szeredi.hu> wrote:

> Fix race in clear_page_dirty_for_io() as suggested by Linus.

please describe this race.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/8] update ctime and mtime for mmaped write
  2007-03-06 22:18           ` Miklos Szeredi
@ 2007-03-06 22:28             ` Peter Zijlstra
  2007-03-06 22:36               ` Miklos Szeredi
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2007-03-06 22:28 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-fsdevel, staubach, hugh

On Tue, 2007-03-06 at 23:18 +0100, Miklos Szeredi wrote:
> > > None what so ever, but I always think of msync as a rare function
> > > (infrequent when compared to (minor) faults overall). But I don't have
> > > numbers backing that up either.
> > > 
> > > Also, the radix tree scan you do isn't exactly cheap either. 
> > > 
> > > So what I was wondering is whether its worth optimizing this at the cost
> > > of another rmap walker. (one with very dubious semantics at that - it
> > > clears the pte dirty bit but doesn't particularly care about that nor
> > > does it respect the PG_dirty / PTE dirty relation)
> > 
> > What this functionality requires is that MS_ASYNC is a full barrier wrt.
> > dirtyness. That is, we want to call set_page_dirty_mappig() as soon as
> > we touch a page in a dirtying fashion after MS_{,A}SYNC gets called.
> > 
> > Hence we need the full page_mkclean() functionality, otherwise we don't
> > set AS_CMTIME again in time.
> 
> AS_CMTIME is only for the case, when the "file modified since the last
> msync" info is lost from the ptes, e.g. because of page reclaim.
> 
> So it doesn't matter if AS_CMTIME is not set, is_page_modified() will
> provide the necessary barrier.

The trouble is, we went from a pull to a push model, and now you're
adding pull code again.

We have PG_dirty correct at all times, I think its no less reasonable to
have AS_CMTIME correct in the same fashion.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/8] update ctime and mtime for mmaped write
  2007-03-06 22:28             ` Peter Zijlstra
@ 2007-03-06 22:36               ` Miklos Szeredi
  0 siblings, 0 replies; 28+ messages in thread
From: Miklos Szeredi @ 2007-03-06 22:36 UTC (permalink / raw)
  To: a.p.zijlstra; +Cc: akpm, linux-kernel, linux-fsdevel, staubach, hugh

> > AS_CMTIME is only for the case, when the "file modified since the last
> > msync" info is lost from the ptes, e.g. because of page reclaim.
> > 
> > So it doesn't matter if AS_CMTIME is not set, is_page_modified() will
> > provide the necessary barrier.
> 
> The trouble is, we went from a pull to a push model, and now you're
> adding pull code again.
> 
> We have PG_dirty correct at all times, I think its no less reasonable to
> have AS_CMTIME correct in the same fashion.

There's a very important difference.  PG_dirty correctness is needed,
because the _number_ of dirty pages needs to be accounted.  But there
is no such requirement for AS_CMTIME, nobody cares _how_many_ pages
have been modified since the last msync() call.  All we need to know
is _if_ the file has been modified or not.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 8/8] consolidate generic_writepages and mpage_writepages fix
  2007-03-06 18:04 ` [patch 8/8] consolidate generic_writepages and mpage_writepages fix Miklos Szeredi
@ 2007-03-07 20:46   ` Andrew Morton
  2007-03-07 21:26     ` Miklos Szeredi
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2007-03-07 20:46 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-kernel, linux-fsdevel

On Tue, 06 Mar 2007 19:04:51 +0100
Miklos Szeredi <miklos@szeredi.hu> wrote:

> From: Miklos Szeredi <mszeredi@suse.cz>
> 
> Fix NULL pointer dereference in __mpage_writepage.  This probably
> doesn't matter in practice, but this is the original behavior.
> 
> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
> ---
> 
> Index: linux/fs/mpage.c
> ===================================================================
> --- linux.orig/fs/mpage.c	2007-03-06 17:30:09.000000000 +0100
> +++ linux/fs/mpage.c	2007-03-06 17:30:33.000000000 +0100
> @@ -662,7 +662,7 @@ confused:
>  	if (bio)
>  		bio = mpage_bio_submit(WRITE, bio);
>  
> -	if (mpd->use_writepage) {
> +	if (mpd->use_writepage && mapping->a_ops->writepage) {
>  		ret = mapping->a_ops->writepage(page, wbc);
>  	} else {
>  		ret = -EAGAIN;
> 
> --

I'm inclined to leave the code as-is, as an optimisation.  If it
blows up then it'd be cleaner to change the caller (mpage_writepages)
to do

	.use_writepage = (writepage != NULL),

?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 8/8] consolidate generic_writepages and mpage_writepages fix
  2007-03-07 20:46   ` Andrew Morton
@ 2007-03-07 21:26     ` Miklos Szeredi
  0 siblings, 0 replies; 28+ messages in thread
From: Miklos Szeredi @ 2007-03-07 21:26 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

> > From: Miklos Szeredi <mszeredi@suse.cz>
> > 
> > Fix NULL pointer dereference in __mpage_writepage.  This probably
> > doesn't matter in practice, but this is the original behavior.
> > 
> > Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
> > ---
> > 
> > Index: linux/fs/mpage.c
> > ===================================================================
> > --- linux.orig/fs/mpage.c	2007-03-06 17:30:09.000000000 +0100
> > +++ linux/fs/mpage.c	2007-03-06 17:30:33.000000000 +0100
> > @@ -662,7 +662,7 @@ confused:
> >  	if (bio)
> >  		bio = mpage_bio_submit(WRITE, bio);
> >  
> > -	if (mpd->use_writepage) {
> > +	if (mpd->use_writepage && mapping->a_ops->writepage) {
> >  		ret = mapping->a_ops->writepage(page, wbc);
> >  	} else {
> >  		ret = -EAGAIN;
> > 
> > --
> 
> I'm inclined to leave the code as-is, as an optimisation.  If it
> blows up then it'd be cleaner to change the caller (mpage_writepages)
> to do
> 
> 	.use_writepage = (writepage != NULL),
> 
> ?

It seems all in-tree uses of mpage_writepages define ->writepage() so
it should be safe to drop this patch.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 3/8] per backing_dev dirty and writeback page accounting
  2007-03-06 18:04 ` [patch 3/8] per backing_dev dirty and writeback page accounting Miklos Szeredi
@ 2007-03-12  6:23   ` David Chinner
  2007-03-12 11:40     ` Miklos Szeredi
  0 siblings, 1 reply; 28+ messages in thread
From: David Chinner @ 2007-03-12  6:23 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-fsdevel

On Tue, Mar 06, 2007 at 07:04:46PM +0100, Miklos Szeredi wrote:
> From: Andrew Morton <akpm@osdl.org>
> 
> [tomoki.sekiyama.qu@hitachi.com: bugfix]
> 
> Miklos Szeredi <mszeredi@suse.cz>:
> 
> Changes:
>  - updated to apply after clear_page_dirty_for_io() race fix
> 
> This is needed for
> 
>  - balance_dirty_pages() deadlock fix
>  - fuse dirty page accounting
> 
> I have no idea how serious the scalability problems with this are.  If
> they are serious, different solutions can probably be found for the
> above, but this is certainly the simplest.

Atomic operations to a single per-backing device from all CPUs at once?
That's a pretty serious scalability issue and it will cause a major
performance regression for XFS.

I'd call this a showstopper right now - maybe you need to look at
something like the ZVC code that Christoph Lameter wrote, perhaps?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 3/8] per backing_dev dirty and writeback page accounting
  2007-03-12  6:23   ` David Chinner
@ 2007-03-12 11:40     ` Miklos Szeredi
  2007-03-12 21:44       ` David Chinner
  0 siblings, 1 reply; 28+ messages in thread
From: Miklos Szeredi @ 2007-03-12 11:40 UTC (permalink / raw)
  To: dgc; +Cc: akpm, linux-kernel, linux-fsdevel

> > I have no idea how serious the scalability problems with this are.  If
> > they are serious, different solutions can probably be found for the
> > above, but this is certainly the simplest.
> 
> Atomic operations to a single per-backing device from all CPUs at once?
> That's a pretty serious scalability issue and it will cause a major
> performance regression for XFS.

OK.  How about just accounting writeback pages?  That should be much
less of a problem, since normally writeback is started from
pdflush/kupdate in large batches without any concurrency.

Or is it possible to export the state of the device queue to mm?
E.g. could balance_dirty_pages() query the backing dev if there are
any outstanding write requests?

> I'd call this a showstopper right now - maybe you need to look at
> something like the ZVC code that Christoph Lameter wrote, perhaps?

That's rather a heavyweight approach for this I think.

The only info balance_dirty_pages() really needs is whether there are
any dirty+writeback bound for the backing dev or not.

It knows about the diry pages, since it calls writeback_inodes() which
scans the dirty pages for this backing dev looking for ones to write
out.  If after returning from writeback_inodes() wbc->nr_to_write
didn't decrease and wbc->pages_skipped is zero then we know that there
are no more dirty pages for the device.  Or at least there are no
dirty pages which aren't already under writeback.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 3/8] per backing_dev dirty and writeback page accounting
  2007-03-12 11:40     ` Miklos Szeredi
@ 2007-03-12 21:44       ` David Chinner
  2007-03-12 22:36         ` Miklos Szeredi
  0 siblings, 1 reply; 28+ messages in thread
From: David Chinner @ 2007-03-12 21:44 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: dgc, akpm, linux-kernel, linux-fsdevel

On Mon, Mar 12, 2007 at 12:40:47PM +0100, Miklos Szeredi wrote:
> > > I have no idea how serious the scalability problems with this are.  If
> > > they are serious, different solutions can probably be found for the
> > > above, but this is certainly the simplest.
> > 
> > Atomic operations to a single per-backing device from all CPUs at once?
> > That's a pretty serious scalability issue and it will cause a major
> > performance regression for XFS.
> 
> OK.  How about just accounting writeback pages?  That should be much
> less of a problem, since normally writeback is started from
> pdflush/kupdate in large batches without any concurrency.

Except when you are throttling you bounce the cacheline around
each cpu as it triggers foreground writeback.....

> Or is it possible to export the state of the device queue to mm?
> E.g. could balance_dirty_pages() query the backing dev if there are
> any outstanding write requests?

Not directly - writeback_in_progress(bdi) is a coarse measure
indicating pdflush is active on this bdi, which implies outstanding
write requests).

> > I'd call this a showstopper right now - maybe you need to look at
> > something like the ZVC code that Christoph Lameter wrote, perhaps?
> 
> That's rather a heavyweight approach for this I think.

But if you want to use per-page accounting, you are going to
need a per-cpu or per-zone set of counters on each bdi to do
this without introducing regressions.

> The only info balance_dirty_pages() really needs is whether there are
> any dirty+writeback bound for the backing dev or not.

writeback bound (i.e. writing as fast as we can) is probably
indicated fairly reliably by bdi_congested(bdi).

Now all you need is the number of dirty pages....

> It knows about the diry pages, since it calls writeback_inodes() which
> scans the dirty pages for this backing dev looking for ones to write
> out.

It scans the dirty inode list for dirty inodes which indirectly finds
the dirty pages. It does not know about the number of dirty pages
directly...

> If after returning from writeback_inodes() wbc->nr_to_write
> didn't decrease and wbc->pages_skipped is zero then we know that there
> are no more dirty pages for the device.  Or at least there are no
> dirty pages which aren't already under writeback.

Sure, you can tell if there are _no_ dirty pages on the bdi, but
if there are dirty pages, you can't tell how many there are. Your
followup patches need to know how many dirty+writeback pages there
are on the bdi, so I don't really see any way you can solve the
deadlock in this manner without scalable bdi->nr_dirty accounting.

----

IIUC, your problem is that there's another bdi that holds all the
dirty pages, and this throttle loop never flushes pages from that
other bdi and we sleep instead. It seems to me that the fundamental
problem is that to clean the pages we need to flush both bdi's, not
just the bdi we are directly dirtying.

How about a "dependent bdi" link? i.e. if you have a loopback
filesystem, it has a direct bdi (the loopback device) and a
dependent bdi - the bdi that belongs to the underlying filesystem.

When we enter the throttle loop we flush from the direct bdi
and if we fail to flush all the pages we require, we flush
the dependent bdi (maybe even just kick pdflush for that bdi)
before we call congestion_wait() and go to sleep. This way
we are always making progress cleaning pages on the machine,
not just transferring dirty pages form one bdi to another.

Wouldn't that solve the deadlock without needing painful
accounting?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 3/8] per backing_dev dirty and writeback page accounting
  2007-03-12 21:44       ` David Chinner
@ 2007-03-12 22:36         ` Miklos Szeredi
  2007-03-12 23:12           ` David Chinner
  0 siblings, 1 reply; 28+ messages in thread
From: Miklos Szeredi @ 2007-03-12 22:36 UTC (permalink / raw)
  To: dgc; +Cc: akpm, linux-kernel, linux-fsdevel

I'll try to explain the reason for the deadlock first.

> IIUC, your problem is that there's another bdi that holds all the
> dirty pages, and this throttle loop never flushes pages from that
> other bdi and we sleep instead. It seems to me that the fundamental
> problem is that to clean the pages we need to flush both bdi's, not
> just the bdi we are directly dirtying.

This is what happens:

write fault on upper filesystem
  balance_dirty_pages
    submit write requests
  loop ...
------- fuse IPC ---------------
[fuse loopback fs thread 1]
read request
sys_write
  mutex_lock(i_mutex)
  ...
     balance_dirty_pages
        submit write requests
        loop ... write requests completed ... dirty still over limit ... 
	... loop forever

[fuse loopback fs thread 1]
read request
sys_write
  mute_lock(i_mutex) blocks

So the queue for the upper filesystem is full.  The queue for the
lower filesystem is empty.  There are no dirty pages in the lower
filesystem.

So kicking pdflush for the lower filesystem doesn't help, there's
nothing to do.  balance_dirty_pages() for the lower filesystem should
just realize that there's nothing to do and return, and then there
would be progress.

So there's there's really no need to do any accounting, just some
logic to determine that a backing dev is nearly or completely
quiescent.

And getting out of this tight situation doesn't have to be efficient.
This is probably a very rare corner case, that almost never happens in
real life, only with aggressive test tools like bash_shared_mapping.

> > OK.  How about just accounting writeback pages?  That should be much
> > less of a problem, since normally writeback is started from
> > pdflush/kupdate in large batches without any concurrency.
> 
> Except when you are throttling you bounce the cacheline around
> each cpu as it triggers foreground writeback.....

Yeah, we'd loose a bit of CPU, but not any write performance, since it
is being throttled back anyway.

> > Or is it possible to export the state of the device queue to mm?
> > E.g. could balance_dirty_pages() query the backing dev if there are
> > any outstanding write requests?
> 
> Not directly - writeback_in_progress(bdi) is a coarse measure
> indicating pdflush is active on this bdi, which implies outstanding
> write requests).

Hmm, not quite what I need.

> > > I'd call this a showstopper right now - maybe you need to look at
> > > something like the ZVC code that Christoph Lameter wrote, perhaps?
> > 
> > That's rather a heavyweight approach for this I think.
> 
> But if you want to use per-page accounting, you are going to
> need a per-cpu or per-zone set of counters on each bdi to do
> this without introducing regressions.

Yes, this is an option, but I hope for a simpler solution.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 3/8] per backing_dev dirty and writeback page accounting
  2007-03-12 22:36         ` Miklos Szeredi
@ 2007-03-12 23:12           ` David Chinner
  2007-03-13  8:21             ` Miklos Szeredi
  0 siblings, 1 reply; 28+ messages in thread
From: David Chinner @ 2007-03-12 23:12 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: dgc, akpm, linux-kernel, linux-fsdevel

On Mon, Mar 12, 2007 at 11:36:16PM +0100, Miklos Szeredi wrote:
> I'll try to explain the reason for the deadlock first.

Ah, thanks for that.

> > IIUC, your problem is that there's another bdi that holds all the
> > dirty pages, and this throttle loop never flushes pages from that
> > other bdi and we sleep instead. It seems to me that the fundamental
> > problem is that to clean the pages we need to flush both bdi's, not
> > just the bdi we are directly dirtying.
> 
> This is what happens:
> 
> write fault on upper filesystem
>   balance_dirty_pages
>     submit write requests
>   loop ...

Isn't this loop transferring the dirty state from the upper
filesystem to the lower filesystem? What I don't see here is
how the pages on this filesystem are not getting cleaned if
the lower filesystem is being flushed properly.

I'm probably missing something big and obvious, but I'm not
familiar with the exact workings of FUSE so please excuse my
ignorance....

> ------- fuse IPC ---------------
> [fuse loopback fs thread 1]

This is the lower filesystem? Or a callback thread for
doing the write requests to the lower filesystem?

> read request
> sys_write
>   mutex_lock(i_mutex)
>   ...
>      balance_dirty_pages
>         submit write requests
>         loop ... write requests completed ... dirty still over limit ... 
> 	... loop forever

Hmmm - the situation in balance_dirty_pages() after an attempt
to writeback_inodes(&wbc) that has written nothing because there
is nothing to write would be:

	wbc->nr_write == write_chunk &&
	wbc->pages_skipped == 0 &&
	wbc->encountered_congestion == 0 &&
	!bdi_congested(wbc->bdi)

What happens if you make that an exit condition to the loop?
Or alternatively, adding another bit to the wbc structure to
say "there was nothing to do" and setting that if we find
list_empty(&sb->s_dirty) when trying to flush dirty inodes."

[ FWIW, this may also solve another problem of fast block devices
being throttled incorrectly when a slow block dev is consuming
all the dirty pages... ]

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 3/8] per backing_dev dirty and writeback page accounting
  2007-03-12 23:12           ` David Chinner
@ 2007-03-13  8:21             ` Miklos Szeredi
  2007-03-13 22:12               ` David Chinner
  0 siblings, 1 reply; 28+ messages in thread
From: Miklos Szeredi @ 2007-03-13  8:21 UTC (permalink / raw)
  To: dgc; +Cc: akpm, linux-kernel, linux-fsdevel

> > > IIUC, your problem is that there's another bdi that holds all the
> > > dirty pages, and this throttle loop never flushes pages from that
> > > other bdi and we sleep instead. It seems to me that the fundamental
> > > problem is that to clean the pages we need to flush both bdi's, not
> > > just the bdi we are directly dirtying.
> > 
> > This is what happens:
> > 
> > write fault on upper filesystem
> >   balance_dirty_pages
> >     submit write requests
> >   loop ...
> 
> Isn't this loop transferring the dirty state from the upper
> filesystem to the lower filesystem?

What this loop is doing is putting write requests in the request
queue, and in so doing transforming page state from dirty to
writeback.

> What I don't see here is how the pages on this filesystem are not
> getting cleaned if the lower filesystem is being flushed properly.

Because the lower filesystem writes back one request, but then gets
stuck in balance_dirty_pages before returning.  So the write request
is never completed.

The problem is that balance_dirty_pages is waiting for the condition
that the global number of dirty+writeback pages goes below the
threshold.  But this condition can only be satisfied if
balance_dirty_pages() returns.

> I'm probably missing something big and obvious, but I'm not
> familiar with the exact workings of FUSE so please excuse my
> ignorance....
> 
> > ------- fuse IPC ---------------
> > [fuse loopback fs thread 1]
> 
> This is the lower filesystem? Or a callback thread for
> doing the write requests to the lower filesystem?

This is the fuse daemon.  It's a normal process that reads requests
from /dev/fuse, serves these requests then writes the reply back onto
/dev/fuse.  It is usually multithreaded, so it can serve many requests
in parallel.

The loopback filesystem serves the requests by issuing the relevant
filesystem syscalls on the underlying fs.

> > read request
> > sys_write
> >   mutex_lock(i_mutex)
> >   ...
> >      balance_dirty_pages
> >         submit write requests
> >         loop ... write requests completed ... dirty still over limit ... 
> > 	... loop forever
> 
> Hmmm - the situation in balance_dirty_pages() after an attempt
> to writeback_inodes(&wbc) that has written nothing because there
> is nothing to write would be:
> 
> 	wbc->nr_write == write_chunk &&
> 	wbc->pages_skipped == 0 &&
> 	wbc->encountered_congestion == 0 &&
> 	!bdi_congested(wbc->bdi)
> 
> What happens if you make that an exit condition to the loop?

That's almost right.  The only problem is that even if there's no
congestion, the device queue can be holding a great amount of yet
unwritten pages.  So exiting on this condition would mean, that
dirty+writeback could go way over the threshold.

How much this would be a problem?  I don't know, I guess it depends on
many things: how many queues, how many requests per queue, how many
bytes per request.

> Or alternatively, adding another bit to the wbc structure to
> say "there was nothing to do" and setting that if we find
> list_empty(&sb->s_dirty) when trying to flush dirty inodes."
> 
> [ FWIW, this may also solve another problem of fast block devices
> being throttled incorrectly when a slow block dev is consuming
> all the dirty pages... ]

There may be a patch floating around, which I think basically does
this, but only as long as the dirty+writeback are over a soft limit,
but under the hard limit.

When over the the hard limit, balance_dirty_pages still loops until
dirty+writeback go below the threshold.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 3/8] per backing_dev dirty and writeback page accounting
  2007-03-13  8:21             ` Miklos Szeredi
@ 2007-03-13 22:12               ` David Chinner
  2007-03-14 22:09                 ` Miklos Szeredi
  0 siblings, 1 reply; 28+ messages in thread
From: David Chinner @ 2007-03-13 22:12 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: dgc, akpm, linux-kernel, linux-fsdevel

On Tue, Mar 13, 2007 at 09:21:59AM +0100, Miklos Szeredi wrote:
> > > read request
> > > sys_write
> > >   mutex_lock(i_mutex)
> > >   ...
> > >      balance_dirty_pages
> > >         submit write requests
> > >         loop ... write requests completed ... dirty still over limit ... 
> > > 	... loop forever
> > 
> > Hmmm - the situation in balance_dirty_pages() after an attempt
> > to writeback_inodes(&wbc) that has written nothing because there
> > is nothing to write would be:
> > 
> > 	wbc->nr_write == write_chunk &&
> > 	wbc->pages_skipped == 0 &&
> > 	wbc->encountered_congestion == 0 &&
> > 	!bdi_congested(wbc->bdi)
> > 
> > What happens if you make that an exit condition to the loop?
> 
> That's almost right.  The only problem is that even if there's no
> congestion, the device queue can be holding a great amount of yet
> unwritten pages.  So exiting on this condition would mean, that
> dirty+writeback could go way over the threshold.

Only if the queue depth is not bound. Queue depths are bound and so
the distance we can go over the threshold is limited.  This is the
fundamental principle on which the throttling is based.....

Hence, if the queue is not full, then we will have either written
dirty pages to it (i.e wbc->nr_write != write_chunk so we will throttle
or continue normally if write_chunk was written) or we have no more
dirty pages left.

Having no dirty pages left on the bdi and it not being congested
means we effectively have a clean, idle bdi. We should not be trying
to throttle writeback here - we can't do anything to improve the
situation by continuing to try to do writeback on this bdi, so we
may as well give up and let the writer continue. Once we have dirty
pages on the bdi, we'll get throttled appropriately.

The point I'm making here is that if the bdi is not congested, any
pages dirtied on that bdi can be cleaned _quickly_ and so writing
more pages to it isn't a big deal even if we are over the global
dirty threshold.

Remember, the global dirty threshold is not really a hard limit -
it's a threshold at which we change behaviour. Throttling idle bdi's
does not contribute usefully to reducing the number of dirty pages
in the system; all it really does is deny service to devices that could
otherwise be doing useful work.

> How much this would be a problem?  I don't know, I guess it depends on
> many things: how many queues, how many requests per queue, how many
> bytes per request.

Right, and most ppl don't have enough devices in their system for
this to be a problem. Even those of us that do have enough devices
for this to potentially be a problem usually have enough RAM in
the machine so that it is not a problem....

> > Or alternatively, adding another bit to the wbc structure to
> > say "there was nothing to do" and setting that if we find
> > list_empty(&sb->s_dirty) when trying to flush dirty inodes."
> > 
> > [ FWIW, this may also solve another problem of fast block devices
> > being throttled incorrectly when a slow block dev is consuming
> > all the dirty pages... ]
> 
> There may be a patch floating around, which I think basically does
> this, but only as long as the dirty+writeback are over a soft limit,
> but under the hard limit.
> 
> When over the the hard limit, balance_dirty_pages still loops until
> dirty+writeback go below the threshold.

The difference between the two methods is that if there is any hard
limit that results in balance_dirty_pages looping then you have a
potential deadlock.  Hence the soft+hard limits will reduce the
occurrence but not remove the deadlock. Breaking out of the loop
when there is nothing to do simply means we'll reenter again
with something to do very shortly (and *then* throttle) if the
process continues to write.....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 3/8] per backing_dev dirty and writeback page accounting
  2007-03-13 22:12               ` David Chinner
@ 2007-03-14 22:09                 ` Miklos Szeredi
  0 siblings, 0 replies; 28+ messages in thread
From: Miklos Szeredi @ 2007-03-14 22:09 UTC (permalink / raw)
  To: dgc; +Cc: akpm, linux-kernel, linux-fsdevel

> Only if the queue depth is not bound. Queue depths are bound and so
> the distance we can go over the threshold is limited.  This is the
> fundamental principle on which the throttling is based.....
> 
> Hence, if the queue is not full, then we will have either written
> dirty pages to it (i.e wbc->nr_write != write_chunk so we will throttle
> or continue normally if write_chunk was written) or we have no more
> dirty pages left.
> 
> Having no dirty pages left on the bdi and it not being congested
> means we effectively have a clean, idle bdi. We should not be trying
> to throttle writeback here - we can't do anything to improve the
> situation by continuing to try to do writeback on this bdi, so we
> may as well give up and let the writer continue. Once we have dirty
> pages on the bdi, we'll get throttled appropriately.

OK, you convinced me.

How about this patch?  I introduced a new wbc counter, that sums the
number of dirty pages encountered, including ones already under
writeback.

Dave, big thanks for your insights.

Miklos

Index: linux/include/linux/writeback.h
===================================================================
--- linux.orig/include/linux/writeback.h	2007-03-14 22:43:42.000000000 +0100
+++ linux/include/linux/writeback.h	2007-03-14 22:58:56.000000000 +0100
@@ -44,6 +44,7 @@ struct writeback_control {
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */
+	long nr_dirty;			/* Number of dirty pages encountered */
 
 	/*
 	 * For a_ops->writepages(): is start or end are non-zero then this is
Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c	2007-03-14 22:41:01.000000000 +0100
+++ linux/mm/page-writeback.c	2007-03-14 23:00:20.000000000 +0100
@@ -220,6 +220,17 @@ static void balance_dirty_pages(struct a
 			pages_written += write_chunk - wbc.nr_to_write;
 			if (pages_written >= write_chunk)
 				break;		/* We've done our duty */
+
+			/*
+			 * If just a few dirty pages were encountered, and
+			 * the queue is not congested, then allow this dirty
+			 * producer to continue.  This resolves the deadlock
+			 * that happens when one filesystem writes back data
+			 * through another.  It should also help when a slow
+			 * device is completely blocking other writes.
+			 */
+			if (wbc.nr_dirty < 8 && !bdi_write_congested(bdi))
+				break;
 		}
 		congestion_wait(WRITE, HZ/10);
 	}
@@ -612,6 +623,7 @@ retry:
 					      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
 		unsigned i;
 
+		wbc->nr_dirty += nr_pages;
 		scanned = 1;
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2007-03-14 22:09 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-06 18:04 [patch 0/8] VFS/VM patches Miklos Szeredi
2007-03-06 18:04 ` [patch 1/8] fix race in clear_page_dirty_for_io() Miklos Szeredi
2007-03-06 22:25   ` Andrew Morton
2007-03-06 18:04 ` [patch 2/8] update ctime and mtime for mmaped write Miklos Szeredi
2007-03-06 20:32   ` Peter Zijlstra
2007-03-06 21:24     ` Miklos Szeredi
2007-03-06 21:47       ` Peter Zijlstra
2007-03-06 22:00         ` Miklos Szeredi
2007-03-06 22:07         ` Peter Zijlstra
2007-03-06 22:18           ` Miklos Szeredi
2007-03-06 22:28             ` Peter Zijlstra
2007-03-06 22:36               ` Miklos Szeredi
2007-03-06 18:04 ` [patch 3/8] per backing_dev dirty and writeback page accounting Miklos Szeredi
2007-03-12  6:23   ` David Chinner
2007-03-12 11:40     ` Miklos Szeredi
2007-03-12 21:44       ` David Chinner
2007-03-12 22:36         ` Miklos Szeredi
2007-03-12 23:12           ` David Chinner
2007-03-13  8:21             ` Miklos Szeredi
2007-03-13 22:12               ` David Chinner
2007-03-14 22:09                 ` Miklos Szeredi
2007-03-06 18:04 ` [patch 4/8] fix deadlock in balance_dirty_pages Miklos Szeredi
2007-03-06 18:04 ` [patch 5/8] fix deadlock in throttle_vm_writeout Miklos Szeredi
2007-03-06 18:04 ` [patch 6/8] balance dirty pages from loop device Miklos Szeredi
2007-03-06 18:04 ` [patch 7/8] add filesystem subtype support Miklos Szeredi
2007-03-06 18:04 ` [patch 8/8] consolidate generic_writepages and mpage_writepages fix Miklos Szeredi
2007-03-07 20:46   ` Andrew Morton
2007-03-07 21:26     ` Miklos Szeredi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).