linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] Fix filesystem freezing
@ 2012-01-20 20:34 Jan Kara
  2012-01-20 20:34 ` [PATCH 1/8] fs: Improve filesystem freezing handling Jan Kara
                   ` (7 more replies)
  0 siblings, 8 replies; 26+ messages in thread
From: Jan Kara @ 2012-01-20 20:34 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Eric Sandeen, Dave Chinner, Surbhi Palande, Kamal Mostafa,
	Christoph Hellwig, LKML, xfs, linux-ext4


  Hello,

  This is the second iteration of my patches to improve filesystem freezing.
Filesystem freezing is currently racy and thus we can end up with dirty data
on frozen filesystem (see changelog of the first patch for detailed race
description and proposed fix). This patch series aims at fixing this.

The biggest changes since v1:
  * have two counters to provide safe state transitions for SB_FREEZE_WRITE
    and SB_FREEZE_TRANS states
  * use percpu counters instead of own percpu structure
  * added documentation fixes from the old fs freezing series
  * converted XFS to use SB_FREEZE_TRANS counter instead of its private
    m_active_trans counter

Now I'm sending this mostly as a heads up so that people see current state
because these patches are still racy. The problem is that when superblock is in
SB_FREEZE_WRITE state inodes can still be dirtied (e.g. through touch_atime,
directory operations or similar). These inodes then are not flushed and when
superblock passes to SB_FREEZE_TRANS state, flusher thread deadlocks on trying
to flush them.

To fix this, we will have to block also operations dirtying inodes already
in SB_FREEZE_WRITE state. I'm now looking into how to do this in the easiest
way.

								Honza

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/8] fs: Improve filesystem freezing handling
  2012-01-20 20:34 [PATCH 0/8] Fix filesystem freezing Jan Kara
@ 2012-01-20 20:34 ` Jan Kara
  2012-02-04  3:03   ` Eric Sandeen
  2012-01-20 20:34 ` [PATCH 2/8] vfs: Protect write paths by sb_start_write - sb_end_write Jan Kara
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2012-01-20 20:34 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Eric Sandeen, Dave Chinner, Surbhi Palande, Kamal Mostafa,
	Christoph Hellwig, LKML, xfs, linux-ext4, Jan Kara

vfs_check_frozen() tests are racy since the filesystem can be frozen just after
the test is performed. Thus in write paths we can end up marking some pages or
inodes dirty even though filesystem is already frozen. This creates problems
with flusher thread hanging on frozen filesystem.

Another problem is that exclusion between ->page_mkwrite() and filesystem
freezing has been handled by setting page dirty and then verifying s_frozen.
This guaranteed that either the freezing code sees the faulted page, writes it,
and writeprotects it again or we see s_frozen set and bail out of page fault.
This works to protect from page being marked writeable while filesystem
freezing is running but has an unpleasant artefact of leaving dirty (although
unmodified and writeprotected) pages on frozen filesystem resulting in similar
problems with flusher thread as the first problem.

This patch aims at providing exclusion between write paths and filesystem
freezing. We implement a writer-freeze read-write semaphores in the superblock
for each freezing level (currently there are two - SB_FREEZE_WRITE for data and
SB_FREEZE_TRANS for metadata). Write paths which should block freezing on given
level (e.g. ->block_page_mkwrite(), ->aio_write() for SB_FREEZE_WRITE level;
transaction lifetime for SB_FREEZE_TRANS level) hold reader side of the
semaphore. Code freezing the filesystem to a given level takes the writer side.

Only that we don't really want to bounce cachelines of the semaphore between
CPUs for each write happening. So we implement the reader side of the semaphore
as a per-cpu counter and the writer side is implemented using s_frozen
superblock field.

Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/super.c         |  204 ++++++++++++++++++++++++++++++++++++++++++++++++----
 include/linux/fs.h |   41 ++++++++---
 2 files changed, 220 insertions(+), 25 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index afd0f1a..4aaad7e 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -32,12 +32,15 @@
 #include <linux/backing-dev.h>
 #include <linux/rculist_bl.h>
 #include <linux/cleancache.h>
+#include <linux/lockdep.h>
 #include "internal.h"
 
 
 LIST_HEAD(super_blocks);
 DEFINE_SPINLOCK(sb_lock);
 
+static struct lock_class_key sb_writers_key[SB_FREEZE_LEVELS-1];
+
 /*
  * One thing we have to be careful of with a per-sb shrinker is that we don't
  * drop the last active reference to the superblock from within the shrinker.
@@ -101,6 +104,24 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 	return total_objects;
 }
 
+static int init_sb_writers(struct super_block *s, int level, char *lockname)
+{
+	struct sb_writers_level *sl = &s->s_writers[level-1];
+	int err;
+
+	err = percpu_counter_init(&sl->counter, 0);
+	if (err < 0)
+		return err;
+	init_waitqueue_head(&sl->wait);
+	lockdep_init_map(&sl->lock_map, lockname, &sb_writers_key[level-1], 0);
+	return 0;
+}
+
+static void destroy_sb_writers(struct super_block *s, int level)
+{
+	percpu_counter_destroy(&s->s_writers[level-1].counter);
+}
+
 /**
  *	alloc_super	-	create new superblock
  *	@type:	filesystem type superblock should belong to
@@ -115,18 +136,19 @@ static struct super_block *alloc_super(struct file_system_type *type)
 
 	if (s) {
 		if (security_sb_alloc(s)) {
+			/*
+			 * We cannot call security_sb_free() without
+			 * security_sb_alloc() succeeding. So bail out manually
+			 */
 			kfree(s);
 			s = NULL;
 			goto out;
 		}
 #ifdef CONFIG_SMP
 		s->s_files = alloc_percpu(struct list_head);
-		if (!s->s_files) {
-			security_sb_free(s);
-			kfree(s);
-			s = NULL;
-			goto out;
-		} else {
+		if (!s->s_files)
+			goto err_out;
+		else {
 			int i;
 
 			for_each_possible_cpu(i)
@@ -135,6 +157,11 @@ static struct super_block *alloc_super(struct file_system_type *type)
 #else
 		INIT_LIST_HEAD(&s->s_files);
 #endif
+		if (init_sb_writers(s, SB_FREEZE_WRITE, "sb_writers_write"))
+			goto err_out;
+		if (init_sb_writers(s, SB_FREEZE_TRANS, "sb_writers_trans"))
+			goto err_out;
+
 		s->s_bdi = &default_backing_dev_info;
 		INIT_LIST_HEAD(&s->s_instances);
 		INIT_HLIST_BL_HEAD(&s->s_anon);
@@ -186,6 +213,17 @@ static struct super_block *alloc_super(struct file_system_type *type)
 	}
 out:
 	return s;
+err_out:
+	security_sb_free(s);
+#ifdef CONFIG_SMP
+	if (s->s_files)
+		free_percpu(s->s_files);
+#endif
+	destroy_sb_writers(s, SB_FREEZE_WRITE);
+	destroy_sb_writers(s, SB_FREEZE_TRANS);
+	kfree(s);
+	s = NULL;
+	goto out;
 }
 
 /**
@@ -1126,6 +1164,148 @@ out:
 }
 
 /**
+ * sb_end_write - drop write access to a superblock
+ * @sb: the super we wrote to
+ * @level: the lowest level of freezing which we blocked
+ *
+ * Decrement number of writers to the filesystem preventing freezing of
+ * given level. Wake up possible waiters wanting to freeze the filesystem.
+ */
+void sb_end_write(struct super_block *sb, int level)
+{
+	struct sb_writers_level *sl = &sb->s_writers[level-1];
+
+	percpu_counter_dec(&sl->counter);
+	/*
+	 * Make sure s_writers are updated before we wake up waiters in
+	 * freeze_super().
+	 */
+	smp_mb();
+	if (waitqueue_active(&sl->wait))
+		wake_up(&sl->wait);
+	rwsem_release(&sl->lock_map, 1, _RET_IP_);
+}
+EXPORT_SYMBOL(sb_end_write);
+
+/**
+ * sb_start_write - get write access to a superblock
+ * @sb: the super we write to
+ * @level: the lowest level of freezing which we block
+ *
+ * When a process wants to write data to a filesystem (i.e. dirty a page), it
+ * should embed the operation in a sb_start_write() - sb_end_write() pair to
+ * get exclusion against filesystem freezing. This function increments number
+ * of writers preventing freezing of given level to proceed.  If the file
+ * system is already frozen it waits until it is thawed.
+ *
+ * The lock orderding constraints of sb_start_write() for level SB_FREEZE_WRITE
+ * are following:
+ * mmap_sem			(page-fault)
+ *   -> s_writers		(block_page_mkwrite or equivalent)
+ *
+ * i_mutex			(do_truncate, __generic_file_aio_write)
+ *   -> s_writers
+ *
+ * s_umount			(freeze_super)
+ *   -> s_writers
+ *
+ * For level SB_FREEZE_TRANS lock constraints are rather file system dependent,
+ * in most cases equivalent to constraints for starting a fs transaction.
+ */
+void sb_start_write(struct super_block *sb, int level)
+{
+retry:
+	rwsem_acquire_read(&sb->s_writers[level-1].lock_map, 0, 0, _RET_IP_);
+	vfs_check_frozen(sb, level);
+	percpu_counter_inc(&sb->s_writers[level-1].counter);
+	/*
+	 * Make sure s_writers are updated before we check s_frozen.
+	 * freeze_super() first sets s_frozen and then checks s_writers.
+	 */
+	smp_mb();
+	if (sb->s_frozen >= level) {
+		sb_end_write(sb, level);
+		goto retry;
+	}
+}
+EXPORT_SYMBOL(sb_start_write);
+
+/**
+ * sb_dup_write - get write access to a superblock without blocking
+ * @sb: the super we write to
+ * @level: the lowest level of freezing which we block
+ *
+ * This function is like sb_start_write() only that it does not check s_frozen
+ * in the superblock. The caller can call this function only when it already
+ * holds write access to the superblock at this level (i.e., called
+ * sb_start_write(sb, level) previously).
+ */
+void sb_dup_write(struct super_block *sb, int level)
+{
+	/*
+	 * Trick lockdep into acquiring read lock again without complaining
+	 * about lock recursion
+	 */
+	rwsem_acquire_read(&sb->s_writers[level-1].lock_map, 0, 1, _RET_IP_);
+	percpu_counter_inc(&sb->s_writers[level-1].counter);
+}
+EXPORT_SYMBOL(sb_dup_write);
+
+/**
+ * sb_wait_write - wait until all writers at given level finish
+ * @sb: the super for which we wait
+ * @level: the level at which we wait for writers
+ *
+ * This function waits until there are no writers at given level. Caller
+ * of this function should make sure there can be no new writers at required
+ * level before calling this function. Otherwise this function can livelock.
+ */
+void sb_wait_write(struct super_block *sb, int level)
+{
+	s64 writers;
+	struct sb_writers_level *sl = &sb->s_writers[level-1];
+
+	do {
+		DEFINE_WAIT(wait);
+
+		/*
+		 * We use a barrier in prepare_to_wait() to separate setting
+		 * of s_frozen and checking of s_writers
+		 */
+		prepare_to_wait(&sl->wait, &wait, TASK_UNINTERRUPTIBLE);
+
+		writers = percpu_counter_sum(&sl->counter);
+		if (writers)
+			schedule();
+
+		finish_wait(&sl->wait, &wait);
+	} while (writers);
+}
+EXPORT_SYMBOL(sb_wait_write);
+
+/*
+ * Freeze superblock to given level, wait for writers at given level
+ * to finish.
+ */
+static void sb_freeze_to_level(struct super_block *sb, int level)
+{
+	sb->s_frozen = level;
+
+	/*
+	 * We just cycle-through lockdep here so that it does not complain
+	 * about returning with lock to userspace
+	 */
+	rwsem_acquire(&sb->s_writers[level-1].lock_map, 0, 0, _THIS_IP_);
+	rwsem_release(&sb->s_writers[level-1].lock_map, 1, _THIS_IP_);
+
+	/*
+	 * Now wait for writers to finish. As s_frozen is already set to
+	 * 'level' we are guaranteed there are no new writers at given level.
+	 */
+	sb_wait_write(sb, level);
+}
+
+/**
  * freeze_super - lock the filesystem and force it into a consistent state
  * @sb: the super to lock
  *
@@ -1151,21 +1331,19 @@ int freeze_super(struct super_block *sb)
 		return 0;
 	}
 
-	sb->s_frozen = SB_FREEZE_WRITE;
-	smp_wmb();
-
+	sb_freeze_to_level(sb, SB_FREEZE_WRITE);
 	sync_filesystem(sb);
-
-	sb->s_frozen = SB_FREEZE_TRANS;
-	smp_wmb();
-
+	sb_freeze_to_level(sb, SB_FREEZE_TRANS);
 	sync_blockdev(sb->s_bdev);
+
 	if (sb->s_op->freeze_fs) {
 		ret = sb->s_op->freeze_fs(sb);
 		if (ret) {
 			printk(KERN_ERR
 				"VFS:Filesystem freeze failed\n");
 			sb->s_frozen = SB_UNFROZEN;
+			smp_wmb();
+			wake_up(&sb->s_wait_unfrozen);
 			deactivate_locked_super(sb);
 			return ret;
 		}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e313022..93ce5af 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -395,6 +395,7 @@ struct inodes_stat_t {
 #include <linux/rculist_bl.h>
 #include <linux/shrinker.h>
 #include <linux/atomic.h>
+#include <linux/lockdep.h>
 
 #include <asm/byteorder.h>
 
@@ -1385,6 +1386,24 @@ extern pid_t f_getown(struct file *filp);
 extern int send_sigurg(struct fown_struct *fown);
 
 /*
+ * Snapshotting support.
+ */
+enum {
+	SB_UNFROZEN = 0,
+	SB_FREEZE_WRITE	= 1,
+	SB_FREEZE_TRANS = 2,
+	SB_FREEZE_LEVELS	/* Number of freezing states */
+};
+
+#define vfs_check_frozen(sb, level) \
+	wait_event((sb)->s_wait_unfrozen, ((sb)->s_frozen < (level)))
+
+void sb_end_write(struct super_block *sb, int level);
+void sb_start_write(struct super_block *sb, int level);
+void sb_dup_write(struct super_block *sb, int level);
+void sb_wait_write(struct super_block *sb, int level);
+
+/*
  *	Umount options
  */
 
@@ -1397,6 +1416,15 @@ extern int send_sigurg(struct fown_struct *fown);
 extern struct list_head super_blocks;
 extern spinlock_t sb_lock;
 
+struct sb_writers_level {
+	struct percpu_counter	counter;	/* counter of running writes */
+	wait_queue_head_t	wait;		/* queue for waiting for
+						   writers to finish */
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map	lock_map;
+#endif
+};
+
 struct super_block {
 	struct list_head	s_list;		/* Keep this first */
 	dev_t			s_dev;		/* search index; _not_ kdev_t */
@@ -1445,6 +1473,7 @@ struct super_block {
 
 	int			s_frozen;
 	wait_queue_head_t	s_wait_unfrozen;
+	struct sb_writers_level	s_writers[SB_FREEZE_LEVELS - 1];
 
 	char s_id[32];				/* Informational name */
 	u8 s_uuid[16];				/* UUID */
@@ -1490,18 +1519,6 @@ extern void prune_dcache_sb(struct super_block *sb, int nr_to_scan);
 extern struct timespec current_fs_time(struct super_block *sb);
 
 /*
- * Snapshotting support.
- */
-enum {
-	SB_UNFROZEN = 0,
-	SB_FREEZE_WRITE	= 1,
-	SB_FREEZE_TRANS = 2,
-};
-
-#define vfs_check_frozen(sb, level) \
-	wait_event((sb)->s_wait_unfrozen, ((sb)->s_frozen < (level)))
-
-/*
  * until VFS tracks user namespaces for inodes, just make all files
  * belong to init_user_ns
  */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/8] vfs: Protect write paths by sb_start_write - sb_end_write
  2012-01-20 20:34 [PATCH 0/8] Fix filesystem freezing Jan Kara
  2012-01-20 20:34 ` [PATCH 1/8] fs: Improve filesystem freezing handling Jan Kara
@ 2012-01-20 20:34 ` Jan Kara
  2012-01-24  8:21   ` Dave Chinner
  2012-02-05  6:13   ` Eric Sandeen
  2012-01-20 20:34 ` [PATCH 3/8] ext4: Protect ext4_page_mkwrite & ext4_setattr with " Jan Kara
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 26+ messages in thread
From: Jan Kara @ 2012-01-20 20:34 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Eric Sandeen, Dave Chinner, Surbhi Palande, Kamal Mostafa,
	Christoph Hellwig, LKML, xfs, linux-ext4, Jan Kara

There are three entry points which dirty pages in a filesystem.  mmap (handled
by block_page_mkwrite()), buffered write (handled by
__generic_file_aio_write()), and truncate (it can dirty last partial page -
handled inside each filesystem separately). Protect these places with
sb_start_write() and sb_end_write().

Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/buffer.c  |   22 ++++------------------
 mm/filemap.c |    3 ++-
 2 files changed, 6 insertions(+), 19 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 19d8eb7..550714d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2338,8 +2338,8 @@ EXPORT_SYMBOL(block_commit_write);
  * beyond EOF, then the page is guaranteed safe against truncation until we
  * unlock the page.
  *
- * Direct callers of this function should call vfs_check_frozen() so that page
- * fault does not busyloop until the fs is thawed.
+ * Direct callers of this function should protect against filesystem freezing
+ * using sb_start_write() - sb_end_write() functions.
  */
 int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 			 get_block_t get_block)
@@ -2371,18 +2371,7 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 
 	if (unlikely(ret < 0))
 		goto out_unlock;
-	/*
-	 * Freezing in progress? We check after the page is marked dirty and
-	 * with page lock held so if the test here fails, we are sure freezing
-	 * code will wait during syncing until the page fault is done - at that
-	 * point page will be dirty and unlocked so freezing code will write it
-	 * and writeprotect it again.
-	 */
 	set_page_dirty(page);
-	if (inode->i_sb->s_frozen != SB_UNFROZEN) {
-		ret = -EAGAIN;
-		goto out_unlock;
-	}
 	wait_on_page_writeback(page);
 	return 0;
 out_unlock:
@@ -2397,12 +2386,9 @@ int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 	int ret;
 	struct super_block *sb = vma->vm_file->f_path.dentry->d_inode->i_sb;
 
-	/*
-	 * This check is racy but catches the common case. The check in
-	 * __block_page_mkwrite() is reliable.
-	 */
-	vfs_check_frozen(sb, SB_FREEZE_WRITE);
+	sb_start_write(sb, SB_FREEZE_WRITE);
 	ret = __block_page_mkwrite(vma, vmf, get_block);
+	sb_end_write(sb, SB_FREEZE_WRITE);
 	return block_page_mkwrite_return(ret);
 }
 EXPORT_SYMBOL(block_page_mkwrite);
diff --git a/mm/filemap.c b/mm/filemap.c
index c0018f2..471b9ae 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2529,7 +2529,7 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	count = ocount;
 	pos = *ppos;
 
-	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
 
 	/* We can write back this queue in page reclaim */
 	current->backing_dev_info = mapping->backing_dev_info;
@@ -2601,6 +2601,7 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 				pos, ppos, count, written);
 	}
 out:
+	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
 	current->backing_dev_info = NULL;
 	return written ? written : err;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/8] ext4: Protect ext4_page_mkwrite & ext4_setattr with sb_start_write - sb_end_write
  2012-01-20 20:34 [PATCH 0/8] Fix filesystem freezing Jan Kara
  2012-01-20 20:34 ` [PATCH 1/8] fs: Improve filesystem freezing handling Jan Kara
  2012-01-20 20:34 ` [PATCH 2/8] vfs: Protect write paths by sb_start_write - sb_end_write Jan Kara
@ 2012-01-20 20:34 ` Jan Kara
  2012-01-20 20:34 ` [PATCH 4/8] xfs: Move ilock before transaction start in xfs_setattr_size() Jan Kara
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2012-01-20 20:34 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Eric Sandeen, Dave Chinner, Surbhi Palande, Kamal Mostafa,
	Christoph Hellwig, LKML, xfs, linux-ext4, Jan Kara

Since ext4_page_mkwrite() calls into __block_page_mkwrite() it has to
provide freezing protection on it's own. Also ext4_setattr() needs protection
because ext4_truncate() can modify last page of the file.

Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c |    9 ++++-----
 1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 848f436..05d6328 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4225,6 +4225,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 	if (error)
 		return error;
 
+	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
 	if (is_quota_modification(inode, attr))
 		dquot_initialize(inode);
 	if ((ia_valid & ATTR_UID && attr->ia_uid != inode->i_uid) ||
@@ -4327,6 +4328,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 		rc = ext4_acl_chmod(inode);
 
 err_out:
+	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
 	ext4_std_error(inode->i_sb, error);
 	if (!error)
 		error = rc;
@@ -4733,11 +4735,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	get_block_t *get_block;
 	int retries = 0;
 
-	/*
-	 * This check is racy but catches the common case. We rely on
-	 * __block_page_mkwrite() to do a reliable check.
-	 */
-	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
 	/* Delalloc case is easy... */
 	if (test_opt(inode->i_sb, DELALLOC) &&
 	    !ext4_should_journal_data(inode) &&
@@ -4805,5 +4803,6 @@ retry_alloc:
 out_ret:
 	ret = block_page_mkwrite_return(ret);
 out:
+	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
 	return ret;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/8] xfs: Move ilock before transaction start in xfs_setattr_size()
  2012-01-20 20:34 [PATCH 0/8] Fix filesystem freezing Jan Kara
                   ` (2 preceding siblings ...)
  2012-01-20 20:34 ` [PATCH 3/8] ext4: Protect ext4_page_mkwrite & ext4_setattr with " Jan Kara
@ 2012-01-20 20:34 ` Jan Kara
  2012-01-24  6:59   ` Dave Chinner
  2012-01-20 20:34 ` [PATCH 5/8] xfs: Protect xfs_file_aio_write() & xfs_setattr_size() with sb_start_write - sb_end_write Jan Kara
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2012-01-20 20:34 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Eric Sandeen, Dave Chinner, Surbhi Palande, Kamal Mostafa,
	Christoph Hellwig, LKML, xfs, linux-ext4, Jan Kara, Ben Myers,
	Alex Elder

In xfs we first take ilock and start transaction afterwards. We should obey
this order in all places because otherwise we can create the following deadlock
with filesystem freezing: One process holds ilock and blocks on s_frozen ==
SB_FREEZE_TRANS in xfs_trans_alloc(), another process has a transaction started
(thus blocking freezing) and blocks on ilock. So we have to take ilock earlier
in xfs_setattr_size().

CC: Ben Myers <bpm@sgi.com>
CC: Alex Elder <elder@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/xfs/xfs_iops.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 23ce927..3579bc8 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -850,6 +850,9 @@ xfs_setattr_size(
 	if (error)
 		goto out_unlock;
 
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	lock_flags |= XFS_ILOCK_EXCL;
+
 	tp = xfs_trans_alloc(mp, XFS_TRANS_SETATTR_SIZE);
 	error = xfs_trans_reserve(tp, 0, XFS_ITRUNCATE_LOG_RES(mp), 0,
 				 XFS_TRANS_PERM_LOG_RES,
@@ -860,9 +863,6 @@ xfs_setattr_size(
 	truncate_setsize(inode, iattr->ia_size);
 
 	commit_flags = XFS_TRANS_RELEASE_LOG_RES;
-	lock_flags |= XFS_ILOCK_EXCL;
-
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
 
 	xfs_trans_ijoin(tp, ip, 0);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/8] xfs: Protect xfs_file_aio_write() & xfs_setattr_size() with sb_start_write - sb_end_write
  2012-01-20 20:34 [PATCH 0/8] Fix filesystem freezing Jan Kara
                   ` (3 preceding siblings ...)
  2012-01-20 20:34 ` [PATCH 4/8] xfs: Move ilock before transaction start in xfs_setattr_size() Jan Kara
@ 2012-01-20 20:34 ` Jan Kara
  2012-01-24  7:19   ` Dave Chinner
  2012-02-04  4:30   ` Eric Sandeen
  2012-01-20 20:34 ` [PATCH 6/8] xfs: Use generic writers counter instead of m_active_trans counter Jan Kara
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 26+ messages in thread
From: Jan Kara @ 2012-01-20 20:34 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Eric Sandeen, Dave Chinner, Surbhi Palande, Kamal Mostafa,
	Christoph Hellwig, LKML, xfs, linux-ext4, Jan Kara, Ben Myers,
	Alex Elder

Replace racy xfs_wait_for_freeze() check in xfs_file_aio_write() with
a reliable sb_start_write() - sb_end_write() locking. Due to lock ranking
dictated by the page fault code we have to call sb_start_write() after we
acquire ilock.

Similarly we have to protect xfs_setattr_size() because it can modify last
page of truncated file. Because ilock is dropped in xfs_setattr_size() we
have to drop and retake write access as well to avoid deadlocks.

CC: Ben Myers <bpm@sgi.com>
CC: Alex Elder <elder@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/xfs/xfs_file.c |    6 ++++--
 fs/xfs/xfs_iops.c |    6 ++++++
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 753ed9b..9efd153 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -862,9 +862,11 @@ xfs_file_dio_aio_write(
 		*iolock = XFS_IOLOCK_SHARED;
 	}
 
+	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
 	trace_xfs_file_direct_write(ip, count, iocb->ki_pos, 0);
 	ret = generic_file_direct_write(iocb, iovp,
 			&nr_segs, pos, &iocb->ki_pos, count, ocount);
+	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
 
 	/* No fallback to buffered IO on errors for XFS. */
 	ASSERT(ret < 0 || ret == count);
@@ -899,6 +901,7 @@ xfs_file_buffered_aio_write(
 	/* We can write back this queue in page reclaim */
 	current->backing_dev_info = mapping->backing_dev_info;
 
+	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
 write_retry:
 	trace_xfs_file_buffered_write(ip, count, iocb->ki_pos, 0);
 	ret = generic_file_buffered_write(iocb, iovp, nr_segs,
@@ -914,6 +917,7 @@ write_retry:
 		enospc = 1;
 		goto write_retry;
 	}
+	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
 	current->backing_dev_info = NULL;
 	return ret;
 }
@@ -945,8 +949,6 @@ xfs_file_aio_write(
 	if (ocount == 0)
 		return 0;
 
-	xfs_wait_for_freeze(ip->i_mount, SB_FREEZE_WRITE);
-
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
 		return -EIO;
 
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 3579bc8..798b9c6 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -793,6 +793,7 @@ xfs_setattr_size(
 		return xfs_setattr_nonsize(ip, iattr, 0);
 	}
 
+	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
 	/*
 	 * Make sure that the dquots are attached to the inode.
 	 */
@@ -849,10 +850,14 @@ xfs_setattr_size(
 				     xfs_get_blocks);
 	if (error)
 		goto out_unlock;
+	/* Drop the write access to avoid lock inversion with ilock */
+	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
 
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	lock_flags |= XFS_ILOCK_EXCL;
 
+	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
+
 	tp = xfs_trans_alloc(mp, XFS_TRANS_SETATTR_SIZE);
 	error = xfs_trans_reserve(tp, 0, XFS_ITRUNCATE_LOG_RES(mp), 0,
 				 XFS_TRANS_PERM_LOG_RES,
@@ -924,6 +929,7 @@ xfs_setattr_size(
 
 	error = xfs_trans_commit(tp, XFS_TRANS_RELEASE_LOG_RES);
 out_unlock:
+	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
 	if (lock_flags)
 		xfs_iunlock(ip, lock_flags);
 	return error;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 6/8] xfs: Use generic writers counter instead of m_active_trans counter
  2012-01-20 20:34 [PATCH 0/8] Fix filesystem freezing Jan Kara
                   ` (4 preceding siblings ...)
  2012-01-20 20:34 ` [PATCH 5/8] xfs: Protect xfs_file_aio_write() & xfs_setattr_size() with sb_start_write - sb_end_write Jan Kara
@ 2012-01-20 20:34 ` Jan Kara
  2012-01-24  8:05   ` Dave Chinner
                     ` (2 more replies)
  2012-01-20 20:34 ` [PATCH 7/8] Documentation: Correct s_umount state for freeze_fs/unfreeze_fs Jan Kara
  2012-01-20 20:34 ` [PATCH 8/8] vfs: Document s_frozen state through freeze_super Jan Kara
  7 siblings, 3 replies; 26+ messages in thread
From: Jan Kara @ 2012-01-20 20:34 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Eric Sandeen, Dave Chinner, Surbhi Palande, Kamal Mostafa,
	Christoph Hellwig, LKML, xfs, linux-ext4, Jan Kara

m_active_trans counter is racy wrt filesystem freezing. The patch replaces it
with generic counter of running transactions which is properly synchronized
with filesystem freezing. Things are a bit more complex because we need to log
a dummy transaction and free block counters after the filesystem is frozen so
we need to pass information to _xfs_trans_alloc() whether the transaction is
part of filesystem freezing or not.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/xfs/xfs_fsops.c |    5 +++--
 fs/xfs/xfs_fsops.h |    2 +-
 fs/xfs/xfs_iomap.c |    4 ++--
 fs/xfs/xfs_mount.c |    2 +-
 fs/xfs/xfs_mount.h |    2 --
 fs/xfs/xfs_super.c |    3 +--
 fs/xfs/xfs_sync.c  |   13 +++----------
 fs/xfs/xfs_trans.c |   19 ++++++++++++-------
 fs/xfs/xfs_trans.h |    3 ++-
 9 files changed, 25 insertions(+), 28 deletions(-)

diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 1c6fdeb..503fdfa 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -645,12 +645,13 @@ out:
  */
 int
 xfs_fs_log_dummy(
-	xfs_mount_t	*mp)
+	xfs_mount_t	*mp,
+	bool		for_freeze)
 {
 	xfs_trans_t	*tp;
 	int		error;
 
-	tp = _xfs_trans_alloc(mp, XFS_TRANS_DUMMY1, KM_SLEEP);
+	tp = _xfs_trans_alloc(mp, XFS_TRANS_DUMMY1, KM_SLEEP, for_freeze);
 	error = xfs_trans_reserve(tp, 0, mp->m_sb.sb_sectsize + 128, 0, 0,
 					XFS_DEFAULT_LOG_COUNT);
 	if (error) {
diff --git a/fs/xfs/xfs_fsops.h b/fs/xfs/xfs_fsops.h
index 1b6a98b..4d6253e 100644
--- a/fs/xfs/xfs_fsops.h
+++ b/fs/xfs/xfs_fsops.h
@@ -25,6 +25,6 @@ extern int xfs_fs_counts(xfs_mount_t *mp, xfs_fsop_counts_t *cnt);
 extern int xfs_reserve_blocks(xfs_mount_t *mp, __uint64_t *inval,
 				xfs_fsop_resblks_t *outval);
 extern int xfs_fs_goingdown(xfs_mount_t *mp, __uint32_t inflags);
-extern int xfs_fs_log_dummy(struct xfs_mount *mp);
+extern int xfs_fs_log_dummy(struct xfs_mount *mp, bool for_freeze);
 
 #endif	/* __XFS_FSOPS_H__ */
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 9afa282..fd47f6e 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -679,8 +679,8 @@ xfs_iomap_write_unwritten(
 		 * the same inode that we complete here and might deadlock
 		 * on the iolock.
 		 */
-		xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
-		tp = _xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE, KM_NOFS);
+		tp = _xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE,
+				KM_NOFS, false);
 		tp->t_flags |= XFS_TRANS_RESERVE;
 		error = xfs_trans_reserve(tp, resblks,
 				XFS_WRITE_LOG_RES(mp), 0,
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index d06afbc..74a93c9 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1572,7 +1572,7 @@ xfs_log_sbcount(xfs_mount_t *mp)
 	if (!xfs_sb_version_haslazysbcount(&mp->m_sb))
 		return 0;
 
-	tp = _xfs_trans_alloc(mp, XFS_TRANS_SB_COUNT, KM_SLEEP);
+	tp = _xfs_trans_alloc(mp, XFS_TRANS_SB_COUNT, KM_SLEEP, true);
 	error = xfs_trans_reserve(tp, 0, mp->m_sb.sb_sectsize + 128, 0, 0,
 					XFS_DEFAULT_LOG_COUNT);
 	if (error) {
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index bb24dac..1317018 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -195,7 +195,6 @@ typedef struct xfs_mount {
 	uint			m_chsize;	/* size of next field */
 	struct xfs_chash	*m_chash;	/* fs private inode per-cluster
 						 * hash table */
-	atomic_t		m_active_trans;	/* number trans frozen */
 #ifdef HAVE_PERCPU_SB
 	xfs_icsb_cnts_t __percpu *m_sb_cnts;	/* per-cpu superblock counters */
 	unsigned long		m_icsb_counters; /* disabled per-cpu counters */
@@ -312,7 +311,6 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, int flags, char *fname,
 #define SHUTDOWN_DEVICE_REQ	0x0020	/* failed all paths to the device */
 
 #define xfs_test_for_freeze(mp)		((mp)->m_super->s_frozen)
-#define xfs_wait_for_freeze(mp,l)	vfs_check_frozen((mp)->m_super, (l))
 
 /*
  * Flags for xfs_mountfs
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 3eca58f..2c333be 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1242,7 +1242,7 @@ xfs_fs_freeze(
 
 	xfs_save_resvblks(mp);
 	xfs_quiesce_attr(mp);
-	return -xfs_fs_log_dummy(mp);
+	return -xfs_fs_log_dummy(mp, true);
 }
 
 STATIC int
@@ -1329,7 +1329,6 @@ xfs_fs_fill_super(
 
 	spin_lock_init(&mp->m_sb_lock);
 	mutex_init(&mp->m_growlock);
-	atomic_set(&mp->m_active_trans, 0);
 
 	mp->m_super = sb;
 	sb->s_fs_info = mp;
diff --git a/fs/xfs/xfs_sync.c b/fs/xfs/xfs_sync.c
index aa3dc1a..24f4d7c 100644
--- a/fs/xfs/xfs_sync.c
+++ b/fs/xfs/xfs_sync.c
@@ -373,7 +373,7 @@ xfs_quiesce_data(
 
 	/* mark the log as covered if needed */
 	if (xfs_log_need_covered(mp))
-		error2 = xfs_fs_log_dummy(mp);
+		error2 = xfs_fs_log_dummy(mp, false);
 
 	/* flush data-only devices */
 	if (mp->m_rtdev_targp)
@@ -421,18 +421,11 @@ xfs_quiesce_attr(
 	int	error = 0;
 
 	/* wait for all modifications to complete */
-	while (atomic_read(&mp->m_active_trans) > 0)
-		delay(100);
+	sb_wait_write(mp->m_super, SB_FREEZE_TRANS);
 
 	/* flush inodes and push all remaining buffers out to disk */
 	xfs_quiesce_fs(mp);
 
-	/*
-	 * Just warn here till VFS can correctly support
-	 * read-only remount without racing.
-	 */
-	WARN_ON(atomic_read(&mp->m_active_trans) != 0);
-
 	/* Push the superblock and write an unmount record */
 	error = xfs_log_sbcount(mp);
 	if (error)
@@ -467,7 +460,7 @@ xfs_sync_worker(
 		/* dgc: errors ignored here */
 		if (mp->m_super->s_frozen == SB_UNFROZEN &&
 		    xfs_log_need_covered(mp))
-			error = xfs_fs_log_dummy(mp);
+			error = xfs_fs_log_dummy(mp, false);
 		else
 			xfs_log_force(mp, 0);
 		error = xfs_qm_sync(mp, SYNC_TRYLOCK);
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 1f35b2f..e97014b 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -577,24 +577,28 @@ xfs_trans_alloc(
 	xfs_mount_t	*mp,
 	uint		type)
 {
-	xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
-	return _xfs_trans_alloc(mp, type, KM_SLEEP);
+	return _xfs_trans_alloc(mp, type, KM_SLEEP, false);
 }
 
 xfs_trans_t *
 _xfs_trans_alloc(
 	xfs_mount_t	*mp,
 	uint		type,
-	uint		memflags)
+	uint		memflags,
+	bool		freezing)
 {
 	xfs_trans_t	*tp;
 
-	atomic_inc(&mp->m_active_trans);
-
+	if (!freezing)
+		sb_start_write(mp->m_super, SB_FREEZE_TRANS);
+	else
+		WARN_ON(xfs_test_for_freeze(mp) != SB_FREEZE_TRANS);
 	tp = kmem_zone_zalloc(xfs_trans_zone, memflags);
 	tp->t_magic = XFS_TRANS_MAGIC;
 	tp->t_type = type;
 	tp->t_mountp = mp;
+	if (freezing)
+		tp->t_flags |= XFS_TRANS_FREEZING;
 	INIT_LIST_HEAD(&tp->t_items);
 	INIT_LIST_HEAD(&tp->t_busy);
 	return tp;
@@ -611,7 +615,8 @@ xfs_trans_free(
 	xfs_alloc_busy_sort(&tp->t_busy);
 	xfs_alloc_busy_clear(tp->t_mountp, &tp->t_busy, false);
 
-	atomic_dec(&tp->t_mountp->m_active_trans);
+	if (!(tp->t_flags & XFS_TRANS_FREEZING))
+		sb_end_write(tp->t_mountp->m_super, SB_FREEZE_TRANS);
 	xfs_trans_free_dqinfo(tp);
 	kmem_zone_free(xfs_trans_zone, tp);
 }
@@ -654,7 +659,7 @@ xfs_trans_dup(
 
 	xfs_trans_dup_dqinfo(tp, ntp);
 
-	atomic_inc(&tp->t_mountp->m_active_trans);
+	sb_dup_write(tp->t_mountp->m_super, SB_FREEZE_TRANS);
 	return ntp;
 }
 
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 3ae713c..8a04d8e 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -180,6 +180,7 @@ struct xfs_log_item_desc {
 #define	XFS_TRANS_SYNC		0x08	/* make commit synchronous */
 #define XFS_TRANS_DQ_DIRTY	0x10	/* at least one dquot in trx dirty */
 #define XFS_TRANS_RESERVE	0x20    /* OK to use reserved data blocks */
+#define XFS_TRANS_FREEZING	0x40	/* can happen on frozen filesystem */
 
 /*
  * Values for call flags parameter.
@@ -448,7 +449,7 @@ typedef struct xfs_trans {
  * XFS transaction mechanism exported interfaces.
  */
 xfs_trans_t	*xfs_trans_alloc(struct xfs_mount *, uint);
-xfs_trans_t	*_xfs_trans_alloc(struct xfs_mount *, uint, uint);
+xfs_trans_t	*_xfs_trans_alloc(struct xfs_mount *, uint, uint, bool);
 xfs_trans_t	*xfs_trans_dup(xfs_trans_t *);
 int		xfs_trans_reserve(xfs_trans_t *, uint, uint, uint,
 				  uint, uint);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 7/8] Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
  2012-01-20 20:34 [PATCH 0/8] Fix filesystem freezing Jan Kara
                   ` (5 preceding siblings ...)
  2012-01-20 20:34 ` [PATCH 6/8] xfs: Use generic writers counter instead of m_active_trans counter Jan Kara
@ 2012-01-20 20:34 ` Jan Kara
  2012-01-20 20:34 ` [PATCH 8/8] vfs: Document s_frozen state through freeze_super Jan Kara
  7 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2012-01-20 20:34 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Eric Sandeen, Dave Chinner, Surbhi Palande, Kamal Mostafa,
	Christoph Hellwig, LKML, xfs, linux-ext4, Valerie Aurora,
	Jan Kara

From: Valerie Aurora <val@vaaconsulting.com>

freeze_fs/unfreeze_fs ops are called with s_umount held for write, not read.

Signed-off-by: Valerie Aurora <val@vaaconsulting.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 Documentation/filesystems/Locking |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index d819ba1..7e46a94 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -134,8 +134,8 @@ evict_inode:
 put_super:		write
 write_super:		read
 sync_fs:		read
-freeze_fs:		read
-unfreeze_fs:		read
+freeze_fs:		write
+unfreeze_fs:		write
 statfs:			maybe(read)	(see below)
 remount_fs:		write
 umount_begin:		no
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 8/8] vfs: Document s_frozen state through freeze_super
  2012-01-20 20:34 [PATCH 0/8] Fix filesystem freezing Jan Kara
                   ` (6 preceding siblings ...)
  2012-01-20 20:34 ` [PATCH 7/8] Documentation: Correct s_umount state for freeze_fs/unfreeze_fs Jan Kara
@ 2012-01-20 20:34 ` Jan Kara
  7 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2012-01-20 20:34 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Eric Sandeen, Dave Chinner, Surbhi Palande, Kamal Mostafa,
	Christoph Hellwig, LKML, xfs, linux-ext4, Valerie Aurora,
	Jan Kara

From: Valerie Aurora <val@vaaconsulting.com>

[kamal@canonical.com: patch restructure]
[jack@suse.cz: updated comments to reflect new situation]

Signed-off-by: Valerie Aurora <val@vaaconsulting.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/super.c |   18 ++++++++++++++++++
 1 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 4aaad7e..557a4ef 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1312,6 +1312,24 @@ static void sb_freeze_to_level(struct super_block *sb, int level)
  * Syncs the super to make sure the filesystem is consistent and calls the fs's
  * freeze_fs.  Subsequent calls to this without first thawing the fs will return
  * -EBUSY.
+ *
+ * During this function, sb->s_frozen goes through these values:
+ *
+ * SB_UNFROZEN: File system is normal, all writes progress as usual.
+ *
+ * SB_FREEZE_WRITE: The file system is in the process of being frozen.
+ * New writes should be blocked, we wait for in-progress writes (using
+ * s_writers[SB_FREEZE_WRITE-1]) and then sync the file system.
+ *
+ * SB_FREEZE_TRANS: The file system is frozen. We wait for all metadata writes
+ * to finish using s_writers[SB_FREEZE_TRANS-1] and after that call ->freeze_fs
+ * to finish filesystem freezing. After ->freeze_fs is done nothing should be
+ * dirty and all modifications are blocked until the file system is thawed.
+ *
+ * sb->s_frozen is protected by sb->s_umount.  Additionally,
+ * SB_FREEZE_WRITE is only temporarily set during freeze/thaw while
+ * holding sb->s_umount for writing, so any other callers holding
+ * sb->s_umount will never see this state.
  */
 int freeze_super(struct super_block *sb)
 {
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/8] xfs: Move ilock before transaction start in xfs_setattr_size()
  2012-01-20 20:34 ` [PATCH 4/8] xfs: Move ilock before transaction start in xfs_setattr_size() Jan Kara
@ 2012-01-24  6:59   ` Dave Chinner
  2012-01-24 11:52     ` Jan Kara
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2012-01-24  6:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Eric Sandeen, Dave Chinner, Surbhi Palande,
	Kamal Mostafa, Christoph Hellwig, LKML, xfs, linux-ext4,
	Ben Myers, Alex Elder

On Fri, Jan 20, 2012 at 09:34:42PM +0100, Jan Kara wrote:
> In xfs we first take ilock and start transaction afterwards.

The correct order is to allocate the transaction, reserve the space
for it and then take the ilock.  We cannot hold the ilock over the
transaction reservation because that can deadlock the journal.

That is, to make space for the new transaction reservation, we may
need to take the ilock to flush the inode and allow the journal tail
to move forwards to make space for the new transaction.  If we
already hold the ilock, then it can't be flushed, we can't make
space available in the journal and hence deadlock.

Maybe you confused the ilock vs the iolock.  We can hold the iolock
over the trans alloc/reserve because that lock is not required to
move the tail of the journal, so the deadlock doesn't exist.

> We should obey
> this order in all places because otherwise we can create the following deadlock
> with filesystem freezing: One process holds ilock and blocks on s_frozen ==
> SB_FREEZE_TRANS in xfs_trans_alloc(), another process has a transaction started
> (thus blocking freezing) and blocks on ilock. So we have to take ilock earlier
> in xfs_setattr_size().

Where are we taking the ilock and then calling xfs_trans_alloc()?
That's the caller needs to be fixed, not the 40-odd that do the
right thing by taking the ilock -after- the trans alloc/reserve
calls.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/8] xfs: Protect xfs_file_aio_write() & xfs_setattr_size() with sb_start_write - sb_end_write
  2012-01-20 20:34 ` [PATCH 5/8] xfs: Protect xfs_file_aio_write() & xfs_setattr_size() with sb_start_write - sb_end_write Jan Kara
@ 2012-01-24  7:19   ` Dave Chinner
  2012-01-24 19:35     ` Jan Kara
  2012-02-04  4:30   ` Eric Sandeen
  1 sibling, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2012-01-24  7:19 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Eric Sandeen, Dave Chinner, Surbhi Palande,
	Kamal Mostafa, Christoph Hellwig, LKML, xfs, linux-ext4,
	Ben Myers, Alex Elder

On Fri, Jan 20, 2012 at 09:34:43PM +0100, Jan Kara wrote:
> Replace racy xfs_wait_for_freeze() check in xfs_file_aio_write() with
> a reliable sb_start_write() - sb_end_write() locking. Due to lock ranking
> dictated by the page fault code we have to call sb_start_write() after we
> acquire ilock.

It appears to me that you have indeed confused the ilock with the
iolock.

> Similarly we have to protect xfs_setattr_size() because it can modify last
> page of truncated file. Because ilock is dropped in xfs_setattr_size() we
> have to drop and retake write access as well to avoid deadlocks.

> 
> CC: Ben Myers <bpm@sgi.com>
> CC: Alex Elder <elder@kernel.org>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/xfs/xfs_file.c |    6 ++++--
>  fs/xfs/xfs_iops.c |    6 ++++++
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 753ed9b..9efd153 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -862,9 +862,11 @@ xfs_file_dio_aio_write(
>  		*iolock = XFS_IOLOCK_SHARED;
>  	}
>  
> +	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
>  	trace_xfs_file_direct_write(ip, count, iocb->ki_pos, 0);
>  	ret = generic_file_direct_write(iocb, iovp,
>  			&nr_segs, pos, &iocb->ki_pos, count, ocount);
> +	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);

That's inside the iolock, not the ilock. Either way, it is
incorrect. This accounting should be outside the iolock - because
xfs_trans_alloc() can be called with the iolock held. Therefore the
freeze/lock order needs to be

	sb_start_write(SB_FREEZE_WRITE)
	  XFS(ip)->i_iolock
	    XFS(ip)->i_ilock
	sb_end_write(SB_FREEZE_WRITE)

Which matches the current freeze/lock order.

> @@ -945,8 +949,6 @@ xfs_file_aio_write(
>  	if (ocount == 0)
>  		return 0;
>  
> -	xfs_wait_for_freeze(ip->i_mount, SB_FREEZE_WRITE);
> -

that's where sb_start_write() needs to be, and the sb-end_write()
call needs to below the generic_write_sync() calls that will trigger
IO on O_SYNC writes. Otherwise it is not covering all the IO path
correctly.

>  	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
>  		return -EIO;
>  
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 3579bc8..798b9c6 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -793,6 +793,7 @@ xfs_setattr_size(
>  		return xfs_setattr_nonsize(ip, iattr, 0);
>  	}
>  
> +	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
>  	/*
>  	 * Make sure that the dquots are attached to the inode.
>  	 */
> @@ -849,10 +850,14 @@ xfs_setattr_size(
>  				     xfs_get_blocks);
>  	if (error)
>  		goto out_unlock;
> +	/* Drop the write access to avoid lock inversion with ilock */
> +	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
>  
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
>  	lock_flags |= XFS_ILOCK_EXCL;
>  
> +	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
> +

This is caused by the previous problems I pointed out. You should
not need to drop the freeze reference here at all.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/8] xfs: Use generic writers counter instead of m_active_trans counter
  2012-01-20 20:34 ` [PATCH 6/8] xfs: Use generic writers counter instead of m_active_trans counter Jan Kara
@ 2012-01-24  8:05   ` Dave Chinner
  2012-02-04  2:13     ` Eric Sandeen
  2012-02-04  2:42   ` Eric Sandeen
  2012-02-04  4:34   ` Eric Sandeen
  2 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2012-01-24  8:05 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Eric Sandeen, Dave Chinner, Surbhi Palande,
	Kamal Mostafa, Christoph Hellwig, LKML, xfs, linux-ext4

On Fri, Jan 20, 2012 at 09:34:44PM +0100, Jan Kara wrote:
> m_active_trans counter is racy wrt filesystem freezing. The patch replaces it
> with generic counter of running transactions which is properly synchronized
> with filesystem freezing. Things are a bit more complex because we need to log
> a dummy transaction and free block counters after the filesystem is frozen so
> we need to pass information to _xfs_trans_alloc() whether the transaction is
> part of filesystem freezing or not.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/xfs/xfs_fsops.c |    5 +++--
>  fs/xfs/xfs_fsops.h |    2 +-
>  fs/xfs/xfs_iomap.c |    4 ++--
>  fs/xfs/xfs_mount.c |    2 +-
>  fs/xfs/xfs_mount.h |    2 --
>  fs/xfs/xfs_super.c |    3 +--
>  fs/xfs/xfs_sync.c  |   13 +++----------
>  fs/xfs/xfs_trans.c |   19 ++++++++++++-------
>  fs/xfs/xfs_trans.h |    3 ++-
>  9 files changed, 25 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 1c6fdeb..503fdfa 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -645,12 +645,13 @@ out:
>   */
>  int
>  xfs_fs_log_dummy(
> -	xfs_mount_t	*mp)
> +	xfs_mount_t	*mp,
> +	bool		for_freeze)

What does "for_freeze" mean? If it is true, does it mean we are in a
freeze or not in a freeze? I can't really tell from the code,
because it just passed true or false, and in one case the code
always passes false even though the code can be called after
SB_FREEZE_WRITE is set (xfs_quiesce_data() via sync_filesystem())

>  #endif	/* __XFS_FSOPS_H__ */
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 9afa282..fd47f6e 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -679,8 +679,8 @@ xfs_iomap_write_unwritten(
>  		 * the same inode that we complete here and might deadlock
>  		 * on the iolock.
>  		 */
> -		xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> -		tp = _xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE, KM_NOFS);
> +		tp = _xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE,
> +				KM_NOFS, false);

This is a documentation regression - the code was clearly self
documenting w.r.t. frozen filesystem behaviour. It isn't anymore.

I'd suggest that we need:

#define XFS_WAIT_FOR_FREEZE	false
#define XFS_IGNORE_FROZEN_SB	true

as the parameters here to makeit extremely clear when reading the
code exactly what that last parameter means. i.e. it is self
documenting. That will help clear up a lot of the confusion on what
these magic boolean parameters are supposed to mean....

> @@ -312,7 +311,6 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, int flags, char *fname,
>  #define SHUTDOWN_DEVICE_REQ	0x0020	/* failed all paths to the device */
>  
>  #define xfs_test_for_freeze(mp)		((mp)->m_super->s_frozen)

I'd remove this, too, and just open code it.

> diff --git a/fs/xfs/xfs_sync.c b/fs/xfs/xfs_sync.c
> index aa3dc1a..24f4d7c 100644
> --- a/fs/xfs/xfs_sync.c
> +++ b/fs/xfs/xfs_sync.c
> @@ -373,7 +373,7 @@ xfs_quiesce_data(
>  
>  	/* mark the log as covered if needed */
>  	if (xfs_log_need_covered(mp))
> -		error2 = xfs_fs_log_dummy(mp);
> +		error2 = xfs_fs_log_dummy(mp, false);

This is the call that can occur inside SB_FREEZE_WRITE context as
well as outside it.

>  
>  	/* flush data-only devices */
>  	if (mp->m_rtdev_targp)
> @@ -421,18 +421,11 @@ xfs_quiesce_attr(
>  	int	error = 0;
>  
>  	/* wait for all modifications to complete */
> -	while (atomic_read(&mp->m_active_trans) > 0)
> -		delay(100);
> +	sb_wait_write(mp->m_super, SB_FREEZE_TRANS);
>  
>  	/* flush inodes and push all remaining buffers out to disk */
>  	xfs_quiesce_fs(mp);
>  
> -	/*
> -	 * Just warn here till VFS can correctly support
> -	 * read-only remount without racing.
> -	 */
> -	WARN_ON(atomic_read(&mp->m_active_trans) != 0);
> -

Now there's an interesting question. Does this break read-only
remount?

/me checks the sb_wait_write() code

No, it looks like it should be fine.

>  	/* Push the superblock and write an unmount record */
>  	error = xfs_log_sbcount(mp);
>  	if (error)
> @@ -467,7 +460,7 @@ xfs_sync_worker(
>  		/* dgc: errors ignored here */
>  		if (mp->m_super->s_frozen == SB_UNFROZEN &&
>  		    xfs_log_need_covered(mp))
> -			error = xfs_fs_log_dummy(mp);
> +			error = xfs_fs_log_dummy(mp, false);
>  		else
>  			xfs_log_force(mp, 0);
>  		error = xfs_qm_sync(mp, SYNC_TRYLOCK);
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 1f35b2f..e97014b 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -577,24 +577,28 @@ xfs_trans_alloc(
>  	xfs_mount_t	*mp,
>  	uint		type)
>  {
> -	xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> -	return _xfs_trans_alloc(mp, type, KM_SLEEP);
> +	return _xfs_trans_alloc(mp, type, KM_SLEEP, false);
>  }
>  
>  xfs_trans_t *
>  _xfs_trans_alloc(
>  	xfs_mount_t	*mp,
>  	uint		type,
> -	uint		memflags)
> +	uint		memflags,
> +	bool		freezing)
>  {
>  	xfs_trans_t	*tp;
>  
> -	atomic_inc(&mp->m_active_trans);
> -
> +	if (!freezing)
> +		sb_start_write(mp->m_super, SB_FREEZE_TRANS);
> +	else
> +		WARN_ON(xfs_test_for_freeze(mp) != SB_FREEZE_TRANS);

Just open code xfs_test_for_freeze() and add a line of whitespace
after this.

>  	tp = kmem_zone_zalloc(xfs_trans_zone, memflags);
>  	tp->t_magic = XFS_TRANS_MAGIC;
>  	tp->t_type = type;
>  	tp->t_mountp = mp;
> +	if (freezing)
> +		tp->t_flags |= XFS_TRANS_FREEZING;

Simply assign the value - tp->t_flags is guaranteed to be 0 right
now.

>  	INIT_LIST_HEAD(&tp->t_items);
>  	INIT_LIST_HEAD(&tp->t_busy);
>  	return tp;
> @@ -611,7 +615,8 @@ xfs_trans_free(
>  	xfs_alloc_busy_sort(&tp->t_busy);
>  	xfs_alloc_busy_clear(tp->t_mountp, &tp->t_busy, false);
>  
> -	atomic_dec(&tp->t_mountp->m_active_trans);
> +	if (!(tp->t_flags & XFS_TRANS_FREEZING))
> +		sb_end_write(tp->t_mountp->m_super, SB_FREEZE_TRANS);
>  	xfs_trans_free_dqinfo(tp);
>  	kmem_zone_free(xfs_trans_zone, tp);
>  }
> @@ -654,7 +659,7 @@ xfs_trans_dup(
>  
>  	xfs_trans_dup_dqinfo(tp, ntp);
>  
> -	atomic_inc(&tp->t_mountp->m_active_trans);
> +	sb_dup_write(tp->t_mountp->m_super, SB_FREEZE_TRANS);

That's strangly named. Isn't the normal thing to do here use a "__"
prefix for operations that just need an extra reference because they
already have one (i.e. __sb_start_write())?

This also looks broken with repsect to the new XFS_TRANS_FREEZING
flag. If it is set on the parent, it needs to be set on the
duplicated transaction. And if it is set, then no extra reference
should be taken.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/8] vfs: Protect write paths by sb_start_write - sb_end_write
  2012-01-20 20:34 ` [PATCH 2/8] vfs: Protect write paths by sb_start_write - sb_end_write Jan Kara
@ 2012-01-24  8:21   ` Dave Chinner
  2012-01-24 11:44     ` Jan Kara
  2012-02-05  6:13   ` Eric Sandeen
  1 sibling, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2012-01-24  8:21 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Eric Sandeen, Dave Chinner, Surbhi Palande,
	Kamal Mostafa, Christoph Hellwig, LKML, xfs, linux-ext4

On Fri, Jan 20, 2012 at 09:34:40PM +0100, Jan Kara wrote:
> There are three entry points which dirty pages in a filesystem.  mmap (handled
> by block_page_mkwrite()), buffered write (handled by
> __generic_file_aio_write()), and truncate (it can dirty last partial page -
> handled inside each filesystem separately). Protect these places with
> sb_start_write() and sb_end_write().

fallocate can also dirty pages, either during preallocation or hole
punching.  Hence if you are going to promote truncate to
SB_FREEZE_WRITE protection then you need to promote everything else
that can zero partial blocks as well.

That also means that anything the has implemented XFS_IOC_ ioctl
interfaces for prellocation and hole punching (xfs, ocfs2 and gfs2
IIRC) also needs to be protected in the same way.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/8] vfs: Protect write paths by sb_start_write - sb_end_write
  2012-01-24  8:21   ` Dave Chinner
@ 2012-01-24 11:44     ` Jan Kara
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2012-01-24 11:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-fsdevel, Eric Sandeen, Dave Chinner,
	Surbhi Palande, Kamal Mostafa, Christoph Hellwig, LKML, xfs,
	linux-ext4

On Tue 24-01-12 19:21:19, Dave Chinner wrote:
> On Fri, Jan 20, 2012 at 09:34:40PM +0100, Jan Kara wrote:
> > There are three entry points which dirty pages in a filesystem.  mmap (handled
> > by block_page_mkwrite()), buffered write (handled by
> > __generic_file_aio_write()), and truncate (it can dirty last partial page -
> > handled inside each filesystem separately). Protect these places with
> > sb_start_write() and sb_end_write().
> 
> fallocate can also dirty pages, either during preallocation or hole
> punching.  Hence if you are going to promote truncate to
> SB_FREEZE_WRITE protection then you need to promote everything else
> that can zero partial blocks as well.
> 
> That also means that anything the has implemented XFS_IOC_ ioctl
> interfaces for prellocation and hole punching (xfs, ocfs2 and gfs2
> IIRC) also needs to be protected in the same way.
  Yeah, you are right. As I wrote in the introductory mail, there's problem
with metadata operations (e.g. directory modifications) anyway so we'll
have to audit all places where transactions are started. First I'll do this
for ext4 as a POC and then I'll try to do that for XFS if Eric doesn't beat
me to it (he promised to have a look at XFS part ;).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/8] xfs: Move ilock before transaction start in xfs_setattr_size()
  2012-01-24  6:59   ` Dave Chinner
@ 2012-01-24 11:52     ` Jan Kara
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2012-01-24 11:52 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-fsdevel, Eric Sandeen, Dave Chinner,
	Surbhi Palande, Kamal Mostafa, Christoph Hellwig, LKML, xfs,
	linux-ext4, Ben Myers, Alex Elder

On Tue 24-01-12 17:59:45, Dave Chinner wrote:
> On Fri, Jan 20, 2012 at 09:34:42PM +0100, Jan Kara wrote:
> > In xfs we first take ilock and start transaction afterwards.
> 
> The correct order is to allocate the transaction, reserve the space
> for it and then take the ilock.  We cannot hold the ilock over the
> transaction reservation because that can deadlock the journal.
>
> That is, to make space for the new transaction reservation, we may
> need to take the ilock to flush the inode and allow the journal tail
> to move forwards to make space for the new transaction.  If we
> already hold the ilock, then it can't be flushed, we can't make
> space available in the journal and hence deadlock.
  Thanks for clarification!

> Maybe you confused the ilock vs the iolock.  We can hold the iolock
> over the trans alloc/reserve because that lock is not required to
> move the tail of the journal, so the deadlock doesn't exist.
  Ups! I now had a look at what xfs_rw_ilock() does. I always thought it's
just a plain rw semaphore and now I see it takes several locks depending on
the argument. Ugh, a bit surprising for XFS newcomer as me ;) But now
things become clearer so I fix my patches with this new knowledge in mind.
So just disregard my locking comments. They were likely bogus.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/8] xfs: Protect xfs_file_aio_write() & xfs_setattr_size() with sb_start_write - sb_end_write
  2012-01-24  7:19   ` Dave Chinner
@ 2012-01-24 19:35     ` Jan Kara
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2012-01-24 19:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-fsdevel, Eric Sandeen, Dave Chinner,
	Surbhi Palande, Kamal Mostafa, Christoph Hellwig, LKML, xfs,
	linux-ext4, Ben Myers, Alex Elder

On Tue 24-01-12 18:19:26, Dave Chinner wrote:
> On Fri, Jan 20, 2012 at 09:34:43PM +0100, Jan Kara wrote:
> > Replace racy xfs_wait_for_freeze() check in xfs_file_aio_write() with
> > a reliable sb_start_write() - sb_end_write() locking. Due to lock ranking
> > dictated by the page fault code we have to call sb_start_write() after we
> > acquire ilock.
> 
> It appears to me that you have indeed confused the ilock with the
> iolock.
> 
> > Similarly we have to protect xfs_setattr_size() because it can modify last
> > page of truncated file. Because ilock is dropped in xfs_setattr_size() we
> > have to drop and retake write access as well to avoid deadlocks.
> 
> > 
> > CC: Ben Myers <bpm@sgi.com>
> > CC: Alex Elder <elder@kernel.org>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/xfs/xfs_file.c |    6 ++++--
> >  fs/xfs/xfs_iops.c |    6 ++++++
> >  2 files changed, 10 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 753ed9b..9efd153 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -862,9 +862,11 @@ xfs_file_dio_aio_write(
> >  		*iolock = XFS_IOLOCK_SHARED;
> >  	}
> >  
> > +	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
> >  	trace_xfs_file_direct_write(ip, count, iocb->ki_pos, 0);
> >  	ret = generic_file_direct_write(iocb, iovp,
> >  			&nr_segs, pos, &iocb->ki_pos, count, ocount);
> > +	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
> 
> That's inside the iolock, not the ilock. Either way, it is
> incorrect. This accounting should be outside the iolock - because
> xfs_trans_alloc() can be called with the iolock held. Therefore the
> freeze/lock order needs to be
> 
> 	sb_start_write(SB_FREEZE_WRITE)
> 	  XFS(ip)->i_iolock
> 	    XFS(ip)->i_ilock
> 	sb_end_write(SB_FREEZE_WRITE)
> 
> Which matches the current freeze/lock order.
  Hmm, so I was looking at this and I think there are following locking
constrants (please correct me if I have something wrong):
iolock -> trans start (per your claim above)
trans start -> ilock (ditto)
iolock -> mmap_sem (aio write holds iolock and copying data from userspace
  might need mmap sem if it hits page fault)
mmap_sem -> ilock (do_wp_page -> block_page_mkwrite -> __xfs_get_blocks)
freezing -> trans start (so that we can clean the filesystem during
              freezing)

So I see two choices here.
  1) Put 'freezing' above iolock as you suggest. But then handling the page
fault path becomes challenging. We cannot block there easily because we are
called with mmap_sem held. I just talked with Mel and it seems that
dropping mmap_sem in ->page_mkwrite(), blocking, retaking mmap_sem and
returning VM_FAULT_RETRY might work but we'll see whether some other mm guy
won't kill me for that ;).
  2) Put 'freezing' below mmap_sem. That would put it below iolock/i_mutex
as well. Then handling page fault is easy. We could not block in ->aio_write
but we'd have to block in ->write_begin() instead. Similarly we would have
to block in other write paths.

The first approach has the advantage that we could put lots of frozen
checks into VFS thus making them shared among filesystems (possibly even
making freezing reliable for filesystems such as ext2). The second approach
is simpler as we could do most of the freezing checks while we start a
transaction at least for filesystems that have transactions... Any
preferences?

								Honza
 
> > @@ -945,8 +949,6 @@ xfs_file_aio_write(
> >  	if (ocount == 0)
> >  		return 0;
> >  
> > -	xfs_wait_for_freeze(ip->i_mount, SB_FREEZE_WRITE);
> > -
> 
> that's where sb_start_write() needs to be, and the sb-end_write()
> call needs to below the generic_write_sync() calls that will trigger
> IO on O_SYNC writes. Otherwise it is not covering all the IO path
> correctly.
> 
> >  	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
> >  		return -EIO;
> >  
> > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > index 3579bc8..798b9c6 100644
> > --- a/fs/xfs/xfs_iops.c
> > +++ b/fs/xfs/xfs_iops.c
> > @@ -793,6 +793,7 @@ xfs_setattr_size(
> >  		return xfs_setattr_nonsize(ip, iattr, 0);
> >  	}
> >  
> > +	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
> >  	/*
> >  	 * Make sure that the dquots are attached to the inode.
> >  	 */
> > @@ -849,10 +850,14 @@ xfs_setattr_size(
> >  				     xfs_get_blocks);
> >  	if (error)
> >  		goto out_unlock;
> > +	/* Drop the write access to avoid lock inversion with ilock */
> > +	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
> >  
> >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> >  	lock_flags |= XFS_ILOCK_EXCL;
> >  
> > +	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
> > +
> 
> This is caused by the previous problems I pointed out. You should
> not need to drop the freeze reference here at all.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/8] xfs: Use generic writers counter instead of m_active_trans counter
  2012-01-24  8:05   ` Dave Chinner
@ 2012-02-04  2:13     ` Eric Sandeen
  0 siblings, 0 replies; 26+ messages in thread
From: Eric Sandeen @ 2012-02-04  2:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-fsdevel, Dave Chinner, Surbhi Palande,
	Kamal Mostafa, Christoph Hellwig, LKML, xfs, linux-ext4

On 1/24/12 2:05 AM, Dave Chinner wrote:
> On Fri, Jan 20, 2012 at 09:34:44PM +0100, Jan Kara wrote:
>> m_active_trans counter is racy wrt filesystem freezing. The patch replaces it
>> with generic counter of running transactions which is properly synchronized
>> with filesystem freezing. Things are a bit more complex because we need to log
>> a dummy transaction and free block counters after the filesystem is frozen so
>> we need to pass information to _xfs_trans_alloc() whether the transaction is
>> part of filesystem freezing or not.
>>
>> Signed-off-by: Jan Kara <jack@suse.cz>
>> ---
>>  fs/xfs/xfs_fsops.c |    5 +++--
>>  fs/xfs/xfs_fsops.h |    2 +-
>>  fs/xfs/xfs_iomap.c |    4 ++--
>>  fs/xfs/xfs_mount.c |    2 +-
>>  fs/xfs/xfs_mount.h |    2 --
>>  fs/xfs/xfs_super.c |    3 +--
>>  fs/xfs/xfs_sync.c  |   13 +++----------
>>  fs/xfs/xfs_trans.c |   19 ++++++++++++-------
>>  fs/xfs/xfs_trans.h |    3 ++-
>>  9 files changed, 25 insertions(+), 28 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
>> index 1c6fdeb..503fdfa 100644
>> --- a/fs/xfs/xfs_fsops.c
>> +++ b/fs/xfs/xfs_fsops.c
>> @@ -645,12 +645,13 @@ out:
>>   */
>>  int
>>  xfs_fs_log_dummy(
>> -	xfs_mount_t	*mp)
>> +	xfs_mount_t	*mp,
>> +	bool		for_freeze)
> 
> What does "for_freeze" mean? If it is true, does it mean we are in a
> freeze or not in a freeze? I can't really tell from the code,
> because it just passed true or false, and in one case the code
> always passes false even though the code can be called after
> SB_FREEZE_WRITE is set (xfs_quiesce_data() via sync_filesystem())

Is that a problem?  This is all about the FREEZE_TRANS context right?
So whether we are under FREEZE_WRITE (during freeze_super) or not
(during i.e. sys_sync) I think it's ok, no?  See more below...

>>  #endif	/* __XFS_FSOPS_H__ */
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 9afa282..fd47f6e 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -679,8 +679,8 @@ xfs_iomap_write_unwritten(
>>  		 * the same inode that we complete here and might deadlock
>>  		 * on the iolock.
>>  		 */
>> -		xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
>> -		tp = _xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE, KM_NOFS);
>> +		tp = _xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE,
>> +				KM_NOFS, false);
> 
> This is a documentation regression - the code was clearly self
> documenting w.r.t. frozen filesystem behaviour. It isn't anymore.
> 
> I'd suggest that we need:
> 
> #define XFS_WAIT_FOR_FREEZE	false
> #define XFS_IGNORE_FROZEN_SB	true

>From my reading those are confusing as well.  What the flag controls is:

false: do sb_start_write(SB_FREEZE_TRANS) in trans alloc (will call sb_stop_write in trans free)
true: skip sb_start_write & set flag to skip sb_stop_write(SB_FREEZE_TRANS) in trans free

so maybe:

#define XFS_HONOR_FREEZE_TRANS	false
#define XFS_IGNORE_FREEZE_TRANS	true

would be a little clearer?  Or maybe:

#define XFS_INC_FREEZE_TRANS	false
#define XFS_NOINC_FREEZE_TRANS	true

or

#define XFS_TRANS_START_WRITE		false
#define XFS_NO_TRANS_START_WRITE	true

bleah, ok, step away from the bike shed....

> as the parameters here to makeit extremely clear when reading the
> code exactly what that last parameter means. i.e. it is self
> documenting. That will help clear up a lot of the confusion on what
> these magic boolean parameters are supposed to mean....
> 
>> @@ -312,7 +311,6 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, int flags, char *fname,
>>  #define SHUTDOWN_DEVICE_REQ	0x0020	/* failed all paths to the device */
>>  
>>  #define xfs_test_for_freeze(mp)		((mp)->m_super->s_frozen)
> 
> I'd remove this, too, and just open code it.
> 
>> diff --git a/fs/xfs/xfs_sync.c b/fs/xfs/xfs_sync.c
>> index aa3dc1a..24f4d7c 100644
>> --- a/fs/xfs/xfs_sync.c
>> +++ b/fs/xfs/xfs_sync.c
>> @@ -373,7 +373,7 @@ xfs_quiesce_data(
>>  
>>  	/* mark the log as covered if needed */
>>  	if (xfs_log_need_covered(mp))
>> -		error2 = xfs_fs_log_dummy(mp);
>> +		error2 = xfs_fs_log_dummy(mp, false);
> 
> This is the call that can occur inside SB_FREEZE_WRITE context as
> well as outside it.

Somehow I'm missing the problem here.  This basically means that we will
always increment the metadata writer count for the new transaction, and drop
it when done.  But I _think_ that's ok in both spots, no?  Neither is called
inside of a FREEZE_TRANS.

-Eric


>>  
>>  	/* flush data-only devices */
>>  	if (mp->m_rtdev_targp)
>> @@ -421,18 +421,11 @@ xfs_quiesce_attr(
>>  	int	error = 0;
>>  
>>  	/* wait for all modifications to complete */
>> -	while (atomic_read(&mp->m_active_trans) > 0)
>> -		delay(100);
>> +	sb_wait_write(mp->m_super, SB_FREEZE_TRANS);
>>  
>>  	/* flush inodes and push all remaining buffers out to disk */
>>  	xfs_quiesce_fs(mp);
>>  
>> -	/*
>> -	 * Just warn here till VFS can correctly support
>> -	 * read-only remount without racing.
>> -	 */
>> -	WARN_ON(atomic_read(&mp->m_active_trans) != 0);
>> -
> 
> Now there's an interesting question. Does this break read-only
> remount?
> 
> /me checks the sb_wait_write() code
> 
> No, it looks like it should be fine.
> 
>>  	/* Push the superblock and write an unmount record */
>>  	error = xfs_log_sbcount(mp);
>>  	if (error)
>> @@ -467,7 +460,7 @@ xfs_sync_worker(
>>  		/* dgc: errors ignored here */
>>  		if (mp->m_super->s_frozen == SB_UNFROZEN &&
>>  		    xfs_log_need_covered(mp))
>> -			error = xfs_fs_log_dummy(mp);
>> +			error = xfs_fs_log_dummy(mp, false);
>>  		else
>>  			xfs_log_force(mp, 0);
>>  		error = xfs_qm_sync(mp, SYNC_TRYLOCK);
>> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
>> index 1f35b2f..e97014b 100644
>> --- a/fs/xfs/xfs_trans.c
>> +++ b/fs/xfs/xfs_trans.c
>> @@ -577,24 +577,28 @@ xfs_trans_alloc(
>>  	xfs_mount_t	*mp,
>>  	uint		type)
>>  {
>> -	xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
>> -	return _xfs_trans_alloc(mp, type, KM_SLEEP);
>> +	return _xfs_trans_alloc(mp, type, KM_SLEEP, false);
>>  }
>>  
>>  xfs_trans_t *
>>  _xfs_trans_alloc(
>>  	xfs_mount_t	*mp,
>>  	uint		type,
>> -	uint		memflags)
>> +	uint		memflags,
>> +	bool		freezing)
>>  {
>>  	xfs_trans_t	*tp;
>>  
>> -	atomic_inc(&mp->m_active_trans);
>> -
>> +	if (!freezing)
>> +		sb_start_write(mp->m_super, SB_FREEZE_TRANS);
>> +	else
>> +		WARN_ON(xfs_test_for_freeze(mp) != SB_FREEZE_TRANS);
> 
> Just open code xfs_test_for_freeze() and add a line of whitespace
> after this.
> 
>>  	tp = kmem_zone_zalloc(xfs_trans_zone, memflags);
>>  	tp->t_magic = XFS_TRANS_MAGIC;
>>  	tp->t_type = type;
>>  	tp->t_mountp = mp;
>> +	if (freezing)
>> +		tp->t_flags |= XFS_TRANS_FREEZING;
> 
> Simply assign the value - tp->t_flags is guaranteed to be 0 right
> now.
> 
>>  	INIT_LIST_HEAD(&tp->t_items);
>>  	INIT_LIST_HEAD(&tp->t_busy);
>>  	return tp;
>> @@ -611,7 +615,8 @@ xfs_trans_free(
>>  	xfs_alloc_busy_sort(&tp->t_busy);
>>  	xfs_alloc_busy_clear(tp->t_mountp, &tp->t_busy, false);
>>  
>> -	atomic_dec(&tp->t_mountp->m_active_trans);
>> +	if (!(tp->t_flags & XFS_TRANS_FREEZING))
>> +		sb_end_write(tp->t_mountp->m_super, SB_FREEZE_TRANS);
>>  	xfs_trans_free_dqinfo(tp);
>>  	kmem_zone_free(xfs_trans_zone, tp);
>>  }
>> @@ -654,7 +659,7 @@ xfs_trans_dup(
>>  
>>  	xfs_trans_dup_dqinfo(tp, ntp);
>>  
>> -	atomic_inc(&tp->t_mountp->m_active_trans);
>> +	sb_dup_write(tp->t_mountp->m_super, SB_FREEZE_TRANS);
> 
> That's strangly named. Isn't the normal thing to do here use a "__"
> prefix for operations that just need an extra reference because they
> already have one (i.e. __sb_start_write())?
> 
> This also looks broken with repsect to the new XFS_TRANS_FREEZING
> flag. If it is set on the parent, it needs to be set on the
> duplicated transaction. And if it is set, then no extra reference
> should be taken.
> 
> Cheers,
> 
> Dave.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/8] xfs: Use generic writers counter instead of m_active_trans counter
  2012-01-20 20:34 ` [PATCH 6/8] xfs: Use generic writers counter instead of m_active_trans counter Jan Kara
  2012-01-24  8:05   ` Dave Chinner
@ 2012-02-04  2:42   ` Eric Sandeen
  2012-02-04  4:34   ` Eric Sandeen
  2 siblings, 0 replies; 26+ messages in thread
From: Eric Sandeen @ 2012-02-04  2:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Dave Chinner, Surbhi Palande, Kamal Mostafa,
	Christoph Hellwig, LKML, xfs, linux-ext4

On 1/20/12 2:34 PM, Jan Kara wrote:
> m_active_trans counter is racy wrt filesystem freezing. The patch replaces it
> with generic counter of running transactions which is properly synchronized
> with filesystem freezing. Things are a bit more complex because we need to log
> a dummy transaction and free block counters after the filesystem is frozen so
> we need to pass information to _xfs_trans_alloc() whether the transaction is
> part of filesystem freezing or not.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---

...

>  xfs_trans_t *
>  _xfs_trans_alloc(
>  	xfs_mount_t	*mp,
>  	uint		type,
> -	uint		memflags)
> +	uint		memflags,
> +	bool		freezing)
>  {
>  	xfs_trans_t	*tp;
>  
> -	atomic_inc(&mp->m_active_trans);
> -
> +	if (!freezing)
> +		sb_start_write(mp->m_super, SB_FREEZE_TRANS);
> +	else
> +		WARN_ON(xfs_test_for_freeze(mp) != SB_FREEZE_TRANS);

Hm this could be an issue because for both the umount path and the 
freeze / xfs_quiesce_attr path, we call xfs_log_sbcount which sends
"true" for freezing and we'll trip up here because we won't be
in SB_FREEZE_TRANS during umount.

I think we have to push the flag all the way up to xfs_log_sbcount
callers?

-Eric

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/8] fs: Improve filesystem freezing handling
  2012-01-20 20:34 ` [PATCH 1/8] fs: Improve filesystem freezing handling Jan Kara
@ 2012-02-04  3:03   ` Eric Sandeen
  2012-02-06 15:17     ` Jan Kara
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Sandeen @ 2012-02-04  3:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Dave Chinner, Surbhi Palande, Kamal Mostafa,
	Christoph Hellwig, LKML, xfs, linux-ext4

On 1/20/12 2:34 PM, Jan Kara wrote:
> vfs_check_frozen() tests are racy since the filesystem can be frozen just after
> the test is performed. Thus in write paths we can end up marking some pages or
> inodes dirty even though filesystem is already frozen. This creates problems
> with flusher thread hanging on frozen filesystem.
> 
> Another problem is that exclusion between ->page_mkwrite() and filesystem
> freezing has been handled by setting page dirty and then verifying s_frozen.
> This guaranteed that either the freezing code sees the faulted page, writes it,
> and writeprotects it again or we see s_frozen set and bail out of page fault.
> This works to protect from page being marked writeable while filesystem
> freezing is running but has an unpleasant artefact of leaving dirty (although
> unmodified and writeprotected) pages on frozen filesystem resulting in similar
> problems with flusher thread as the first problem.
> 
> This patch aims at providing exclusion between write paths and filesystem
> freezing. We implement a writer-freeze read-write semaphores in the superblock
> for each freezing level (currently there are two - SB_FREEZE_WRITE for data and
> SB_FREEZE_TRANS for metadata). Write paths which should block freezing on given
> level (e.g. ->block_page_mkwrite(), ->aio_write() for SB_FREEZE_WRITE level;
> transaction lifetime for SB_FREEZE_TRANS level) hold reader side of the
> semaphore. Code freezing the filesystem to a given level takes the writer side.
> 
> Only that we don't really want to bounce cachelines of the semaphore between
> CPUs for each write happening. So we implement the reader side of the semaphore
> as a per-cpu counter and the writer side is implemented using s_frozen
> superblock field.
> 
> Acked-by: "Theodore Ts'o" <tytso@mit.edu>
> Signed-off-by: Jan Kara <jack@suse.cz>

...

> @@ -135,6 +157,11 @@ static struct super_block *alloc_super(struct file_system_type *type)
>  #else
>  		INIT_LIST_HEAD(&s->s_files);
>  #endif
> +		if (init_sb_writers(s, SB_FREEZE_WRITE, "sb_writers_write"))
> +			goto err_out;
> +		if (init_sb_writers(s, SB_FREEZE_TRANS, "sb_writers_trans"))
> +			goto err_out;
> +
>  		s->s_bdi = &default_backing_dev_info;
>  		INIT_LIST_HEAD(&s->s_instances);
>  		INIT_HLIST_BL_HEAD(&s->s_anon);
> @@ -186,6 +213,17 @@ static struct super_block *alloc_super(struct file_system_type *type)
>  	}
>  out:
>  	return s;
> +err_out:
> +	security_sb_free(s);
> +#ifdef CONFIG_SMP
> +	if (s->s_files)
> +		free_percpu(s->s_files);
> +#endif
> +	destroy_sb_writers(s, SB_FREEZE_WRITE);
> +	destroy_sb_writers(s, SB_FREEZE_TRANS);

You probably ran into this already but the writer percpu vars need
to be torn down in destroy_super() as well.

-Eric



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/8] xfs: Protect xfs_file_aio_write() & xfs_setattr_size() with sb_start_write - sb_end_write
  2012-01-20 20:34 ` [PATCH 5/8] xfs: Protect xfs_file_aio_write() & xfs_setattr_size() with sb_start_write - sb_end_write Jan Kara
  2012-01-24  7:19   ` Dave Chinner
@ 2012-02-04  4:30   ` Eric Sandeen
  2012-02-04  4:50     ` Eric Sandeen
  2012-02-05 23:11     ` Dave Chinner
  1 sibling, 2 replies; 26+ messages in thread
From: Eric Sandeen @ 2012-02-04  4:30 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Dave Chinner, Surbhi Palande, Kamal Mostafa,
	Christoph Hellwig, LKML, xfs, linux-ext4, Ben Myers, Alex Elder

On 1/20/12 2:34 PM, Jan Kara wrote:
> Replace racy xfs_wait_for_freeze() check in xfs_file_aio_write() with
> a reliable sb_start_write() - sb_end_write() locking.

Here's what I'm running with now.  With this and my modified
patch6, I can pass xfstests 068 on xfs.

-Eric



diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 7e5bc87..f1cacdc 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -874,11 +874,11 @@ xfs_file_aio_write(
 	if (ocount == 0)
 		return 0;
 
-	xfs_wait_for_freeze(ip->i_mount, SB_FREEZE_WRITE);
-
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
 		return -EIO;
 
+	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
+
 	if (unlikely(file->f_flags & O_DIRECT))
 		ret = xfs_file_dio_aio_write(iocb, iovp, nr_segs, pos, ocount);
 	else
@@ -895,6 +895,7 @@ xfs_file_aio_write(
 		if (err < 0)
 			ret = err;
 	}
+	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
 
 	return ret;
 }
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ab30253..7f3fa17 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -773,6 +773,7 @@ xfs_setattr_size(
 			ATTR_MTIME_SET|ATTR_KILL_SUID|ATTR_KILL_SGID|
 			ATTR_KILL_PRIV|ATTR_TIMES_SET)) == 0);
 
+	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
 	lock_flags = XFS_ILOCK_EXCL;
 	if (!(flags & XFS_ATTR_NOLOCK))
 		lock_flags |= XFS_IOLOCK_EXCL;
@@ -792,6 +793,7 @@ xfs_setattr_size(
 		 * Use the regular setattr path to update the timestamps.
 		 */
 		xfs_iunlock(ip, lock_flags);
+		sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
 		iattr->ia_valid &= ~ATTR_SIZE;
 		return xfs_setattr_nonsize(ip, iattr, 0);
 	}
@@ -938,6 +940,7 @@ xfs_setattr_size(
 out_unlock:
 	if (lock_flags)
 		xfs_iunlock(ip, lock_flags);
+	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
 	return error;
 
 out_trans_abort:


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/8] xfs: Use generic writers counter instead of m_active_trans counter
  2012-01-20 20:34 ` [PATCH 6/8] xfs: Use generic writers counter instead of m_active_trans counter Jan Kara
  2012-01-24  8:05   ` Dave Chinner
  2012-02-04  2:42   ` Eric Sandeen
@ 2012-02-04  4:34   ` Eric Sandeen
  2 siblings, 0 replies; 26+ messages in thread
From: Eric Sandeen @ 2012-02-04  4:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Dave Chinner, Surbhi Palande, Kamal Mostafa,
	Christoph Hellwig, LKML, xfs, linux-ext4

On 1/20/12 2:34 PM, Jan Kara wrote:
> m_active_trans counter is racy wrt filesystem freezing. The patch replaces it
> with generic counter of running transactions which is properly synchronized
> with filesystem freezing. Things are a bit more complex because we need to log
> a dummy transaction and free block counters after the filesystem is frozen so
> we need to pass information to _xfs_trans_alloc() whether the transaction is
> part of filesystem freezing or not.

Here's the version I have which will get me past xfstest 068 and avoids
some warnings.

This version brings the sb_start_write go/nogo flag up to
xfs_log_sbcount() to avoid a warning at unmount time.

I'd still like to make some of the variable/macro naming more obvious
but this might do for now.

diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 1c6fdeb..503fdfa 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -645,12 +645,13 @@ out:
  */
 int
 xfs_fs_log_dummy(
-	xfs_mount_t	*mp)
+	xfs_mount_t	*mp,
+	bool		for_freeze)
 {
 	xfs_trans_t	*tp;
 	int		error;
 
-	tp = _xfs_trans_alloc(mp, XFS_TRANS_DUMMY1, KM_SLEEP);
+	tp = _xfs_trans_alloc(mp, XFS_TRANS_DUMMY1, KM_SLEEP, for_freeze);
 	error = xfs_trans_reserve(tp, 0, mp->m_sb.sb_sectsize + 128, 0, 0,
 					XFS_DEFAULT_LOG_COUNT);
 	if (error) {
diff --git a/fs/xfs/xfs_fsops.h b/fs/xfs/xfs_fsops.h
index 1b6a98b..4d6253e 100644
--- a/fs/xfs/xfs_fsops.h
+++ b/fs/xfs/xfs_fsops.h
@@ -25,6 +25,6 @@ extern int xfs_fs_counts(xfs_mount_t *mp, xfs_fsop_counts_t *cnt);
 extern int xfs_reserve_blocks(xfs_mount_t *mp, __uint64_t *inval,
 				xfs_fsop_resblks_t *outval);
 extern int xfs_fs_goingdown(xfs_mount_t *mp, __uint32_t inflags);
-extern int xfs_fs_log_dummy(struct xfs_mount *mp);
+extern int xfs_fs_log_dummy(struct xfs_mount *mp, bool for_freeze);
 
 #endif	/* __XFS_FSOPS_H__ */
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 246c7d5..99f4e42 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -679,8 +679,8 @@ xfs_iomap_write_unwritten(
 		 * the same inode that we complete here and might deadlock
 		 * on the iolock.
 		 */
-		xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
-		tp = _xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE, KM_NOFS);
+		tp = _xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE,
+				KM_NOFS, XFS_HONOR_FREEZE_TRANS);
 		tp->t_flags |= XFS_TRANS_RESERVE;
 		error = xfs_trans_reserve(tp, resblks,
 				XFS_WRITE_LOG_RES(mp), 0,
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index 828662f..b2731ea 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -308,4 +308,7 @@ static inline __uint64_t howmany_64(__uint64_t x, __uint32_t y)
 
 #endif /* DEBUG */
 
+#define XFS_HONOR_FREEZE_TRANS	false
+#define XFS_IGNORE_FREEZE_TRANS	true
+
 #endif /* __XFS_LINUX__ */
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index d06afbc..663bc96 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1513,7 +1513,7 @@ xfs_unmountfs(
 		xfs_warn(mp, "Unable to free reserved block pool. "
 				"Freespace may not be correct on next mount.");
 
-	error = xfs_log_sbcount(mp);
+	error = xfs_log_sbcount(mp, XFS_HONOR_FREEZE_TRANS);
 	if (error)
 		xfs_warn(mp, "Unable to update superblock counters. "
 				"Freespace may not be correct on next mount.");
@@ -1555,7 +1555,7 @@ xfs_fs_writable(xfs_mount_t *mp)
  * block when the transaction subsystem is in its frozen state.
  */
 int
-xfs_log_sbcount(xfs_mount_t *mp)
+xfs_log_sbcount(xfs_mount_t *mp, int freezing)
 {
 	xfs_trans_t	*tp;
 	int		error;
@@ -1572,7 +1572,7 @@ xfs_log_sbcount(xfs_mount_t *mp)
 	if (!xfs_sb_version_haslazysbcount(&mp->m_sb))
 		return 0;
 
-	tp = _xfs_trans_alloc(mp, XFS_TRANS_SB_COUNT, KM_SLEEP);
+	tp = _xfs_trans_alloc(mp, XFS_TRANS_SB_COUNT, KM_SLEEP, freezing);
 	error = xfs_trans_reserve(tp, 0, mp->m_sb.sb_sectsize + 128, 0, 0,
 					XFS_DEFAULT_LOG_COUNT);
 	if (error) {
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 19f69e2..9319047 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -195,7 +195,6 @@ typedef struct xfs_mount {
 	uint			m_chsize;	/* size of next field */
 	struct xfs_chash	*m_chash;	/* fs private inode per-cluster
 						 * hash table */
-	atomic_t		m_active_trans;	/* number trans frozen */
 #ifdef HAVE_PERCPU_SB
 	xfs_icsb_cnts_t __percpu *m_sb_cnts;	/* per-cpu superblock counters */
 	unsigned long		m_icsb_counters; /* disabled per-cpu counters */
@@ -311,7 +310,6 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, int flags, char *fname,
 #define SHUTDOWN_DEVICE_REQ	0x0020	/* failed all paths to the device */
 
 #define xfs_test_for_freeze(mp)		((mp)->m_super->s_frozen)
-#define xfs_wait_for_freeze(mp,l)	vfs_check_frozen((mp)->m_super, (l))
 
 /*
  * Flags for xfs_mountfs
@@ -370,7 +368,7 @@ typedef struct xfs_mod_sb {
 	int64_t		msb_delta;	/* Change to make to specified field */
 } xfs_mod_sb_t;
 
-extern int	xfs_log_sbcount(xfs_mount_t *);
+extern int	xfs_log_sbcount(xfs_mount_t *, int);
 extern __uint64_t xfs_default_resblks(xfs_mount_t *mp);
 extern int	xfs_mountfs(xfs_mount_t *mp);
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index ee5b695..a257077 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1198,7 +1198,7 @@ xfs_fs_freeze(
 
 	xfs_save_resvblks(mp);
 	xfs_quiesce_attr(mp);
-	return -xfs_fs_log_dummy(mp);
+	return -xfs_fs_log_dummy(mp, XFS_IGNORE_FREEZE_TRANS);
 }
 
 STATIC int
@@ -1285,7 +1285,6 @@ xfs_fs_fill_super(
 
 	spin_lock_init(&mp->m_sb_lock);
 	mutex_init(&mp->m_growlock);
-	atomic_set(&mp->m_active_trans, 0);
 
 	mp->m_super = sb;
 	sb->s_fs_info = mp;
diff --git a/fs/xfs/xfs_sync.c b/fs/xfs/xfs_sync.c
index 40b75ee..b9a6be0 100644
--- a/fs/xfs/xfs_sync.c
+++ b/fs/xfs/xfs_sync.c
@@ -406,7 +406,7 @@ xfs_quiesce_data(
 
 	/* mark the log as covered if needed */
 	if (xfs_log_need_covered(mp))
-		error2 = xfs_fs_log_dummy(mp);
+		error2 = xfs_fs_log_dummy(mp, XFS_HONOR_FREEZE_TRANS);
 
 	/* flush data-only devices */
 	if (mp->m_rtdev_targp)
@@ -454,20 +454,13 @@ xfs_quiesce_attr(
 	int	error = 0;
 
 	/* wait for all modifications to complete */
-	while (atomic_read(&mp->m_active_trans) > 0)
-		delay(100);
+	sb_wait_write(mp->m_super, SB_FREEZE_TRANS);
 
 	/* flush inodes and push all remaining buffers out to disk */
 	xfs_quiesce_fs(mp);
 
-	/*
-	 * Just warn here till VFS can correctly support
-	 * read-only remount without racing.
-	 */
-	WARN_ON(atomic_read(&mp->m_active_trans) != 0);
-
 	/* Push the superblock and write an unmount record */
-	error = xfs_log_sbcount(mp);
+	error = xfs_log_sbcount(mp, XFS_IGNORE_FREEZE_TRANS);
 	if (error)
 		xfs_warn(mp, "xfs_attr_quiesce: failed to log sb changes. "
 				"Frozen image may not be consistent.");
@@ -500,7 +493,7 @@ xfs_sync_worker(
 		/* dgc: errors ignored here */
 		if (mp->m_super->s_frozen == SB_UNFROZEN &&
 		    xfs_log_need_covered(mp))
-			error = xfs_fs_log_dummy(mp);
+			error = xfs_fs_log_dummy(mp, XFS_HONOR_FREEZE_TRANS);
 		else
 			xfs_log_force(mp, 0);
 
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 329b06a..dda8877 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -577,24 +577,28 @@ xfs_trans_alloc(
 	xfs_mount_t	*mp,
 	uint		type)
 {
-	xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
-	return _xfs_trans_alloc(mp, type, KM_SLEEP);
+	return _xfs_trans_alloc(mp, type, KM_SLEEP, XFS_HONOR_FREEZE_TRANS);
 }
 
 xfs_trans_t *
 _xfs_trans_alloc(
 	xfs_mount_t	*mp,
 	uint		type,
-	uint		memflags)
+	uint		memflags,
+	bool		freezing)
 {
 	xfs_trans_t	*tp;
 
-	atomic_inc(&mp->m_active_trans);
-
+	if (!freezing)
+		sb_start_write(mp->m_super, SB_FREEZE_TRANS);
+	else
+		WARN_ON(xfs_test_for_freeze(mp) != SB_FREEZE_TRANS);
 	tp = kmem_zone_zalloc(xfs_trans_zone, memflags);
 	tp->t_magic = XFS_TRANS_MAGIC;
 	tp->t_type = type;
 	tp->t_mountp = mp;
+	if (freezing)
+		tp->t_flags |= XFS_TRANS_FREEZING;
 	INIT_LIST_HEAD(&tp->t_items);
 	INIT_LIST_HEAD(&tp->t_busy);
 	return tp;
@@ -611,7 +615,8 @@ xfs_trans_free(
 	xfs_alloc_busy_sort(&tp->t_busy);
 	xfs_alloc_busy_clear(tp->t_mountp, &tp->t_busy, false);
 
-	atomic_dec(&tp->t_mountp->m_active_trans);
+	if (!(tp->t_flags & XFS_TRANS_FREEZING))
+		sb_end_write(tp->t_mountp->m_super, SB_FREEZE_TRANS);
 	xfs_trans_free_dqinfo(tp);
 	kmem_zone_free(xfs_trans_zone, tp);
 }
@@ -654,7 +659,10 @@ xfs_trans_dup(
 
 	xfs_trans_dup_dqinfo(tp, ntp);
 
-	atomic_inc(&tp->t_mountp->m_active_trans);
+	if (tp->t_flags & XFS_TRANS_FREEZING)
+		ntp->t_flags |= XFS_TRANS_FREEZING;
+	else
+		sb_dup_write(tp->t_mountp->m_super, SB_FREEZE_TRANS);
 	return ntp;
 }
 
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index f611870..cae91db 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -179,6 +179,7 @@ struct xfs_log_item_desc {
 #define	XFS_TRANS_SYNC		0x08	/* make commit synchronous */
 #define XFS_TRANS_DQ_DIRTY	0x10	/* at least one dquot in trx dirty */
 #define XFS_TRANS_RESERVE	0x20    /* OK to use reserved data blocks */
+#define XFS_TRANS_FREEZING	0x40	/* can happen on frozen filesystem */
 
 /*
  * Values for call flags parameter.
@@ -447,7 +448,7 @@ typedef struct xfs_trans {
  * XFS transaction mechanism exported interfaces.
  */
 xfs_trans_t	*xfs_trans_alloc(struct xfs_mount *, uint);
-xfs_trans_t	*_xfs_trans_alloc(struct xfs_mount *, uint, uint);
+xfs_trans_t	*_xfs_trans_alloc(struct xfs_mount *, uint, uint, bool);
 xfs_trans_t	*xfs_trans_dup(xfs_trans_t *);
 int		xfs_trans_reserve(xfs_trans_t *, uint, uint, uint,
 				  uint, uint);




^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/8] xfs: Protect xfs_file_aio_write() & xfs_setattr_size() with sb_start_write - sb_end_write
  2012-02-04  4:30   ` Eric Sandeen
@ 2012-02-04  4:50     ` Eric Sandeen
  2012-02-05 23:11     ` Dave Chinner
  1 sibling, 0 replies; 26+ messages in thread
From: Eric Sandeen @ 2012-02-04  4:50 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Jan Kara, linux-fsdevel, Dave Chinner, Surbhi Palande,
	Kamal Mostafa, Christoph Hellwig, LKML, xfs, linux-ext4,
	Ben Myers, Alex Elder

On 2/3/12 10:30 PM, Eric Sandeen wrote:
> On 1/20/12 2:34 PM, Jan Kara wrote:
>> Replace racy xfs_wait_for_freeze() check in xfs_file_aio_write() with
>> a reliable sb_start_write() - sb_end_write() locking.
> 
> Here's what I'm running with now.  With this and my modified
> patch6, I can pass xfstests 068 on xfs.

(and I dropped patch 4, with the lock rearrangement, FWIW - so
the only xfs patches I'm running are the 2 I resent).

-Eric

> -Eric
> 
> 
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 7e5bc87..f1cacdc 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -874,11 +874,11 @@ xfs_file_aio_write(
>  	if (ocount == 0)
>  		return 0;
>  
> -	xfs_wait_for_freeze(ip->i_mount, SB_FREEZE_WRITE);
> -
>  	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
>  		return -EIO;
>  
> +	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
> +
>  	if (unlikely(file->f_flags & O_DIRECT))
>  		ret = xfs_file_dio_aio_write(iocb, iovp, nr_segs, pos, ocount);
>  	else
> @@ -895,6 +895,7 @@ xfs_file_aio_write(
>  		if (err < 0)
>  			ret = err;
>  	}
> +	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
>  
>  	return ret;
>  }
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index ab30253..7f3fa17 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -773,6 +773,7 @@ xfs_setattr_size(
>  			ATTR_MTIME_SET|ATTR_KILL_SUID|ATTR_KILL_SGID|
>  			ATTR_KILL_PRIV|ATTR_TIMES_SET)) == 0);
>  
> +	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
>  	lock_flags = XFS_ILOCK_EXCL;
>  	if (!(flags & XFS_ATTR_NOLOCK))
>  		lock_flags |= XFS_IOLOCK_EXCL;
> @@ -792,6 +793,7 @@ xfs_setattr_size(
>  		 * Use the regular setattr path to update the timestamps.
>  		 */
>  		xfs_iunlock(ip, lock_flags);
> +		sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
>  		iattr->ia_valid &= ~ATTR_SIZE;
>  		return xfs_setattr_nonsize(ip, iattr, 0);
>  	}
> @@ -938,6 +940,7 @@ xfs_setattr_size(
>  out_unlock:
>  	if (lock_flags)
>  		xfs_iunlock(ip, lock_flags);
> +	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
>  	return error;
>  
>  out_trans_abort:
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/8] vfs: Protect write paths by sb_start_write - sb_end_write
  2012-01-20 20:34 ` [PATCH 2/8] vfs: Protect write paths by sb_start_write - sb_end_write Jan Kara
  2012-01-24  8:21   ` Dave Chinner
@ 2012-02-05  6:13   ` Eric Sandeen
  2012-02-06 15:33     ` Jan Kara
  1 sibling, 1 reply; 26+ messages in thread
From: Eric Sandeen @ 2012-02-05  6:13 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Dave Chinner, Surbhi Palande, Kamal Mostafa,
	Christoph Hellwig, LKML, xfs, linux-ext4

On 1/20/12 2:34 PM, Jan Kara wrote:
> There are three entry points which dirty pages in a filesystem.  mmap (handled
> by block_page_mkwrite()), buffered write (handled by
> __generic_file_aio_write()), and truncate (it can dirty last partial page -
> handled inside each filesystem separately). Protect these places with
> sb_start_write() and sb_end_write().

The protection for truncate got lost since the first patchset, was that
on purpose?

-Eric

> Acked-by: "Theodore Ts'o" <tytso@mit.edu>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/buffer.c  |   22 ++++------------------
>  mm/filemap.c |    3 ++-
>  2 files changed, 6 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 19d8eb7..550714d 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2338,8 +2338,8 @@ EXPORT_SYMBOL(block_commit_write);
>   * beyond EOF, then the page is guaranteed safe against truncation until we
>   * unlock the page.
>   *
> - * Direct callers of this function should call vfs_check_frozen() so that page
> - * fault does not busyloop until the fs is thawed.
> + * Direct callers of this function should protect against filesystem freezing
> + * using sb_start_write() - sb_end_write() functions.
>   */
>  int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
>  			 get_block_t get_block)
> @@ -2371,18 +2371,7 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
>  
>  	if (unlikely(ret < 0))
>  		goto out_unlock;
> -	/*
> -	 * Freezing in progress? We check after the page is marked dirty and
> -	 * with page lock held so if the test here fails, we are sure freezing
> -	 * code will wait during syncing until the page fault is done - at that
> -	 * point page will be dirty and unlocked so freezing code will write it
> -	 * and writeprotect it again.
> -	 */
>  	set_page_dirty(page);
> -	if (inode->i_sb->s_frozen != SB_UNFROZEN) {
> -		ret = -EAGAIN;
> -		goto out_unlock;
> -	}
>  	wait_on_page_writeback(page);
>  	return 0;
>  out_unlock:
> @@ -2397,12 +2386,9 @@ int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
>  	int ret;
>  	struct super_block *sb = vma->vm_file->f_path.dentry->d_inode->i_sb;
>  
> -	/*
> -	 * This check is racy but catches the common case. The check in
> -	 * __block_page_mkwrite() is reliable.
> -	 */
> -	vfs_check_frozen(sb, SB_FREEZE_WRITE);
> +	sb_start_write(sb, SB_FREEZE_WRITE);
>  	ret = __block_page_mkwrite(vma, vmf, get_block);
> +	sb_end_write(sb, SB_FREEZE_WRITE);
>  	return block_page_mkwrite_return(ret);
>  }
>  EXPORT_SYMBOL(block_page_mkwrite);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c0018f2..471b9ae 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2529,7 +2529,7 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
>  	count = ocount;
>  	pos = *ppos;
>  
> -	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
> +	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
>  
>  	/* We can write back this queue in page reclaim */
>  	current->backing_dev_info = mapping->backing_dev_info;
> @@ -2601,6 +2601,7 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
>  				pos, ppos, count, written);
>  	}
>  out:
> +	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
>  	current->backing_dev_info = NULL;
>  	return written ? written : err;
>  }


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/8] xfs: Protect xfs_file_aio_write() & xfs_setattr_size() with sb_start_write - sb_end_write
  2012-02-04  4:30   ` Eric Sandeen
  2012-02-04  4:50     ` Eric Sandeen
@ 2012-02-05 23:11     ` Dave Chinner
  1 sibling, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2012-02-05 23:11 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Jan Kara, Alex Elder, Surbhi Palande, Kamal Mostafa, LKML, xfs,
	Christoph Hellwig, Ben Myers, Dave Chinner, linux-fsdevel,
	linux-ext4

On Fri, Feb 03, 2012 at 10:30:05PM -0600, Eric Sandeen wrote:
> On 1/20/12 2:34 PM, Jan Kara wrote:
> > Replace racy xfs_wait_for_freeze() check in xfs_file_aio_write() with
> > a reliable sb_start_write() - sb_end_write() locking.
> 
> Here's what I'm running with now.  With this and my modified
> patch6, I can pass xfstests 068 on xfs.

Just a quick question this raises - is .splice_write() protected?

Cheers,

Dave.
> 
> -Eric
> 
> 
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 7e5bc87..f1cacdc 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -874,11 +874,11 @@ xfs_file_aio_write(
>  	if (ocount == 0)
>  		return 0;
>  
> -	xfs_wait_for_freeze(ip->i_mount, SB_FREEZE_WRITE);
> -
>  	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
>  		return -EIO;
>  
> +	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
> +

This check really should go before the shutdown check - if the
filesystem shuts down while we are freezing or attempting to freeze,
we want to abort the write after we are woken....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/8] fs: Improve filesystem freezing handling
  2012-02-04  3:03   ` Eric Sandeen
@ 2012-02-06 15:17     ` Jan Kara
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2012-02-06 15:17 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Jan Kara, linux-fsdevel, Dave Chinner, Surbhi Palande,
	Kamal Mostafa, Christoph Hellwig, LKML, xfs, linux-ext4

On Fri 03-02-12 21:03:20, Eric Sandeen wrote:
> On 1/20/12 2:34 PM, Jan Kara wrote:
> > vfs_check_frozen() tests are racy since the filesystem can be frozen just after
> > the test is performed. Thus in write paths we can end up marking some pages or
> > inodes dirty even though filesystem is already frozen. This creates problems
> > with flusher thread hanging on frozen filesystem.
> > 
> > Another problem is that exclusion between ->page_mkwrite() and filesystem
> > freezing has been handled by setting page dirty and then verifying s_frozen.
> > This guaranteed that either the freezing code sees the faulted page, writes it,
> > and writeprotects it again or we see s_frozen set and bail out of page fault.
> > This works to protect from page being marked writeable while filesystem
> > freezing is running but has an unpleasant artefact of leaving dirty (although
> > unmodified and writeprotected) pages on frozen filesystem resulting in similar
> > problems with flusher thread as the first problem.
> > 
> > This patch aims at providing exclusion between write paths and filesystem
> > freezing. We implement a writer-freeze read-write semaphores in the superblock
> > for each freezing level (currently there are two - SB_FREEZE_WRITE for data and
> > SB_FREEZE_TRANS for metadata). Write paths which should block freezing on given
> > level (e.g. ->block_page_mkwrite(), ->aio_write() for SB_FREEZE_WRITE level;
> > transaction lifetime for SB_FREEZE_TRANS level) hold reader side of the
> > semaphore. Code freezing the filesystem to a given level takes the writer side.
> > 
> > Only that we don't really want to bounce cachelines of the semaphore between
> > CPUs for each write happening. So we implement the reader side of the semaphore
> > as a per-cpu counter and the writer side is implemented using s_frozen
> > superblock field.
> > 
> > Acked-by: "Theodore Ts'o" <tytso@mit.edu>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> ...
> 
> > @@ -135,6 +157,11 @@ static struct super_block *alloc_super(struct file_system_type *type)
> >  #else
> >  		INIT_LIST_HEAD(&s->s_files);
> >  #endif
> > +		if (init_sb_writers(s, SB_FREEZE_WRITE, "sb_writers_write"))
> > +			goto err_out;
> > +		if (init_sb_writers(s, SB_FREEZE_TRANS, "sb_writers_trans"))
> > +			goto err_out;
> > +
> >  		s->s_bdi = &default_backing_dev_info;
> >  		INIT_LIST_HEAD(&s->s_instances);
> >  		INIT_HLIST_BL_HEAD(&s->s_anon);
> > @@ -186,6 +213,17 @@ static struct super_block *alloc_super(struct file_system_type *type)
> >  	}
> >  out:
> >  	return s;
> > +err_out:
> > +	security_sb_free(s);
> > +#ifdef CONFIG_SMP
> > +	if (s->s_files)
> > +		free_percpu(s->s_files);
> > +#endif
> > +	destroy_sb_writers(s, SB_FREEZE_WRITE);
> > +	destroy_sb_writers(s, SB_FREEZE_TRANS);
> 
> You probably ran into this already but the writer percpu vars need
> to be torn down in destroy_super() as well.
  Actually not. Thanks for spotting this.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/8] vfs: Protect write paths by sb_start_write - sb_end_write
  2012-02-05  6:13   ` Eric Sandeen
@ 2012-02-06 15:33     ` Jan Kara
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2012-02-06 15:33 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Jan Kara, linux-fsdevel, Dave Chinner, Surbhi Palande,
	Kamal Mostafa, Christoph Hellwig, LKML, xfs, linux-ext4

On Sun 05-02-12 00:13:38, Eric Sandeen wrote:
> On 1/20/12 2:34 PM, Jan Kara wrote:
> > There are three entry points which dirty pages in a filesystem.  mmap (handled
> > by block_page_mkwrite()), buffered write (handled by
> > __generic_file_aio_write()), and truncate (it can dirty last partial page -
> > handled inside each filesystem separately). Protect these places with
> > sb_start_write() and sb_end_write().
> 
> The protection for truncate got lost since the first patchset, was that
> on purpose?
  It was not lost but it got moved down into the filesystem. I forgot to
update the changelog. But after Dave's comments I think it can go back into
VFS. Just lockdep complained about deadlocks in my first naive approach -
that's why I started doing weird things with XFS locks after all.

  Anyway now I'm wiser regarding XFS locking and I also have better idea
how to achive proper lock ordering in VFS. Just we are finishing SLE11 SP2
so I didn't get to writing the patches last week... But I should get to it
maybe even today and if not then at least during this week ;)

									Honza

> > Acked-by: "Theodore Ts'o" <tytso@mit.edu>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/buffer.c  |   22 ++++------------------
> >  mm/filemap.c |    3 ++-
> >  2 files changed, 6 insertions(+), 19 deletions(-)
> > 
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index 19d8eb7..550714d 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -2338,8 +2338,8 @@ EXPORT_SYMBOL(block_commit_write);
> >   * beyond EOF, then the page is guaranteed safe against truncation until we
> >   * unlock the page.
> >   *
> > - * Direct callers of this function should call vfs_check_frozen() so that page
> > - * fault does not busyloop until the fs is thawed.
> > + * Direct callers of this function should protect against filesystem freezing
> > + * using sb_start_write() - sb_end_write() functions.
> >   */
> >  int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  			 get_block_t get_block)
> > @@ -2371,18 +2371,7 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  
> >  	if (unlikely(ret < 0))
> >  		goto out_unlock;
> > -	/*
> > -	 * Freezing in progress? We check after the page is marked dirty and
> > -	 * with page lock held so if the test here fails, we are sure freezing
> > -	 * code will wait during syncing until the page fault is done - at that
> > -	 * point page will be dirty and unlocked so freezing code will write it
> > -	 * and writeprotect it again.
> > -	 */
> >  	set_page_dirty(page);
> > -	if (inode->i_sb->s_frozen != SB_UNFROZEN) {
> > -		ret = -EAGAIN;
> > -		goto out_unlock;
> > -	}
> >  	wait_on_page_writeback(page);
> >  	return 0;
> >  out_unlock:
> > @@ -2397,12 +2386,9 @@ int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  	int ret;
> >  	struct super_block *sb = vma->vm_file->f_path.dentry->d_inode->i_sb;
> >  
> > -	/*
> > -	 * This check is racy but catches the common case. The check in
> > -	 * __block_page_mkwrite() is reliable.
> > -	 */
> > -	vfs_check_frozen(sb, SB_FREEZE_WRITE);
> > +	sb_start_write(sb, SB_FREEZE_WRITE);
> >  	ret = __block_page_mkwrite(vma, vmf, get_block);
> > +	sb_end_write(sb, SB_FREEZE_WRITE);
> >  	return block_page_mkwrite_return(ret);
> >  }
> >  EXPORT_SYMBOL(block_page_mkwrite);
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index c0018f2..471b9ae 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2529,7 +2529,7 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
> >  	count = ocount;
> >  	pos = *ppos;
> >  
> > -	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
> > +	sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
> >  
> >  	/* We can write back this queue in page reclaim */
> >  	current->backing_dev_info = mapping->backing_dev_info;
> > @@ -2601,6 +2601,7 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
> >  				pos, ppos, count, written);
> >  	}
> >  out:
> > +	sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
> >  	current->backing_dev_info = NULL;
> >  	return written ? written : err;
> >  }
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2012-02-06 15:33 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-20 20:34 [PATCH 0/8] Fix filesystem freezing Jan Kara
2012-01-20 20:34 ` [PATCH 1/8] fs: Improve filesystem freezing handling Jan Kara
2012-02-04  3:03   ` Eric Sandeen
2012-02-06 15:17     ` Jan Kara
2012-01-20 20:34 ` [PATCH 2/8] vfs: Protect write paths by sb_start_write - sb_end_write Jan Kara
2012-01-24  8:21   ` Dave Chinner
2012-01-24 11:44     ` Jan Kara
2012-02-05  6:13   ` Eric Sandeen
2012-02-06 15:33     ` Jan Kara
2012-01-20 20:34 ` [PATCH 3/8] ext4: Protect ext4_page_mkwrite & ext4_setattr with " Jan Kara
2012-01-20 20:34 ` [PATCH 4/8] xfs: Move ilock before transaction start in xfs_setattr_size() Jan Kara
2012-01-24  6:59   ` Dave Chinner
2012-01-24 11:52     ` Jan Kara
2012-01-20 20:34 ` [PATCH 5/8] xfs: Protect xfs_file_aio_write() & xfs_setattr_size() with sb_start_write - sb_end_write Jan Kara
2012-01-24  7:19   ` Dave Chinner
2012-01-24 19:35     ` Jan Kara
2012-02-04  4:30   ` Eric Sandeen
2012-02-04  4:50     ` Eric Sandeen
2012-02-05 23:11     ` Dave Chinner
2012-01-20 20:34 ` [PATCH 6/8] xfs: Use generic writers counter instead of m_active_trans counter Jan Kara
2012-01-24  8:05   ` Dave Chinner
2012-02-04  2:13     ` Eric Sandeen
2012-02-04  2:42   ` Eric Sandeen
2012-02-04  4:34   ` Eric Sandeen
2012-01-20 20:34 ` [PATCH 7/8] Documentation: Correct s_umount state for freeze_fs/unfreeze_fs Jan Kara
2012-01-20 20:34 ` [PATCH 8/8] vfs: Document s_frozen state through freeze_super Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).