linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] Fix filesystem freezing
@ 2012-01-12  1:20 Jan Kara
  2012-01-12  1:20 ` [PATCH 1/4] fs: Improve filesystem freezing handling Jan Kara
                   ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: Jan Kara @ 2012-01-12  1:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: LKML, linux-ext4, xfs, Eric Sandeen, Dave Chinner,
	Surbhi Palande, Kamal Mostafa, Christoph Hellwig


  Hello,

  filesystem freezing is currently racy and thus we can end up with dirty data
on frozen filesystem (see changelog of the first patch for detailed race
description and proposed fix). This patch series aims at fixing this. So far I
did basic testing using UML linux (running fsstress and bash-shared-mapping
while doing freeze - sync (to deadlock on leftover dirty pages) - unfreeze) and
I'm not able to trigger dirty pages on frozen filesystem anymore. I plan to do
more testing on real machine tomorrow. Surbhi, Kamal, you were able to
reproduce deadlocks due to dirty data as well so please test these patches out.
Thanks. Also review of patches is welcome.

								Honza

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/4] fs: Improve filesystem freezing handling
  2012-01-12  1:20 [PATCH 0/4] Fix filesystem freezing Jan Kara
@ 2012-01-12  1:20 ` Jan Kara
  2012-01-12 19:53   ` Andreas Dilger
                     ` (2 more replies)
  2012-01-12  1:20 ` [PATCH 2/4] vfs: Protect write paths by sb_start_write - sb_end_write Jan Kara
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 21+ messages in thread
From: Jan Kara @ 2012-01-12  1:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: LKML, linux-ext4, xfs, Eric Sandeen, Dave Chinner,
	Surbhi Palande, Kamal Mostafa, Christoph Hellwig, Jan Kara

Currently, exclusion between ->page_mkwrite() and filesystem freezing has been
handled by setting page dirty and then verifying s_frozen. This guaranteed that
either the freezing code sees the faulted page, writes it, and writeprotects it
again or we see s_frozen set and bail out of page fault. This works to protect
from page being marked writeable while filesystem freezing is running but has
an unpleasant artefact of leaving dirty (although unmodified and
writeprotected) pages on frozen filesystem. This artefact then requires
workarounds in writeback code and other places.

Also generally vfs_check_frozen() tests are racy since the filesystem can be
frozen just after the test is performed. Thus in other write paths we can
end up marking some pages or inodes dirty even though filesystem is already
frozen. Again this creates problems with flusher thread hanging on frozen
filesystem.

This patch aims at providing exclusion between write paths which dirty data (we
don't have to worry about metadata since that is handled by filesystems in
->freeze_fs) and filesystem freezing. We implement a writer-freeze read-write
semaphore in the superblock. Write paths which dirty data such as
->block_page_mkwrite() implementations, or ->aio_write() implementations hold
reader side of the semaphore.  Filesystem freezing code holds the writer side.
Only that we don't really want to bounce cachelines of the semaphore between
CPUs for each write happening. So we implement the reader side of the semaphore
as a per-cpu counter and the writer side is implemented using s_frozen
superblock field.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/super.c         |  121 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/fs.h |   14 ++++++
 2 files changed, 134 insertions(+), 1 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index afd0f1a..c85c64c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -32,12 +32,15 @@
 #include <linux/backing-dev.h>
 #include <linux/rculist_bl.h>
 #include <linux/cleancache.h>
+#include <linux/lockdep.h>
 #include "internal.h"
 
 
 LIST_HEAD(super_blocks);
 DEFINE_SPINLOCK(sb_lock);
 
+static struct lock_class_key sb_writers_key;
+
 /*
  * One thing we have to be careful of with a per-sb shrinker is that we don't
  * drop the last active reference to the superblock from within the shrinker.
@@ -183,6 +186,13 @@ static struct super_block *alloc_super(struct file_system_type *type)
 		s->s_shrink.seeks = DEFAULT_SEEKS;
 		s->s_shrink.shrink = prune_super;
 		s->s_shrink.batch = 1024;
+
+		init_waitqueue_head(&s->s_writers_wait);
+#ifdef CONFIG_SMP
+		s->s_page_faults = alloc_percpu(int);
+#endif
+		lockdep_init_map(&s->s_writers_lock_map, "sb_writers",
+				 &sb_writers_key, 0);
 	}
 out:
 	return s;
@@ -1126,6 +1136,84 @@ out:
 }
 
 /**
+ * sb_start_write - drop write access to a superblock
+ * @sb: the super we wrote to
+ *
+ * Decrement number of writers to the filesystem and wake up possible
+ * waiters wanting to freeze the filesystem.
+ */
+void sb_end_write(struct super_block *sb)
+{
+#ifdef CONFIG_SMP
+	this_cpu_dec(sb->s_writers);
+#else
+	preempt_disable();
+	sb->s_writers--;
+	preempt_enable();
+#endif
+	/*
+	 * Make sure s_writers are updated before we wake up waiters in
+	 * freeze_super().
+	 */
+	smp_mb();
+	if (waitqueue_active(&sb->s_writers_wait))
+		wake_up(&sb->s_writers_wait);
+	rwsem_release(&sb->s_writers_lock_map, 1, _RET_IP_);
+}
+
+/**
+ * sb_start_write - get write access to a superblock
+ * @sb: the super we write to
+ *
+ * When a process wants to write data to a filesystem (i.e. dirty a page),
+ * it should embed the operation in a sb_start_write() - sb_end_write() pair
+ * to get exclusion against filesystem freezing. This function increments
+ * number of writers to the filesystem and waits if filesystem is frozen until
+ * it is thawed.
+ */
+void sb_start_write(struct super_block *sb)
+{
+retry:
+	rwsem_acquire_read(&sb->s_writers_lock_map, 0, 0, _RET_IP_);
+	vfs_check_frozen(sb, SB_FREEZE_WRITE);
+#ifdef CONFIG_SMP
+	this_cpu_inc(sb->s_writers);
+#else
+	preempt_disable();
+	sb->s_writers++;
+	preempt_enable();
+#endif
+	/*
+	 * Make sure s_writers are updated before we check s_frozen.
+	 * freeze_super() first sets s_frozen and then checks s_writers.
+	 */
+	smp_mb();
+	if (sb->s_frozen != SB_UNFROZEN) {
+		sb_end_write(sb);
+		goto retry;
+	}
+}
+
+/*
+ * Get number of writers to the superblock
+ */
+static int get_writers_count(struct super_block *sb)
+{
+	int writers;
+#ifdef CONFIG_SMP
+	int cpu;
+
+	writers = 0;
+	for_each_possible_cpu(cpu) {
+		writers += *per_cpu_ptr(sb->s_writers, cpu);
+	}
+#else
+	writers = sb->s_writers;
+#endif
+	return writers;
+}
+
+/**
  * freeze_super - lock the filesystem and force it into a consistent state
  * @sb: the super to lock
  *
@@ -1136,6 +1224,7 @@ out:
 int freeze_super(struct super_block *sb)
 {
 	int ret;
+	int writers;
 
 	atomic_inc(&sb->s_active);
 	down_write(&sb->s_umount);
@@ -1151,8 +1240,36 @@ int freeze_super(struct super_block *sb)
 		return 0;
 	}
 
+	rwsem_acquire(&sb->s_writers_lock_map, 0, 0, _THIS_IP_);
 	sb->s_frozen = SB_FREEZE_WRITE;
-	smp_wmb();
+	/*
+	 * Now wait for all page faults to finish. ->page_mkwrite()
+	 * implementations must call vfs_check_frozen() before starting
+	 * a fault so that we cannot livelock here. Because of that we
+	 * are guaranteed that from this moment on new ->page_mkwrite()
+	 * calls will block and we just have to wait for s_page_faults
+	 * to drop to zero (in a sum).
+	 */
+	do {
+		DEFINE_WAIT(wait);
+
+		/*
+		 * We use a barrier in prepare_to_wait() to separate setting
+		 * of s_frozen and checking of s_writers
+		 */
+		prepare_to_wait(&sb->s_writers_wait, &wait,
+				TASK_UNINTERRUPTIBLE);
+		/*
+		 * We must iterate over all (even offline) CPUs because of CPU
+ 		 * hotplug their entries could still be non-zero. This is slow
+		 * when lots of CPUs are configured but hey, filesystem freezing
+		 * isn't exactly cheap anyway.
+		 */
+		writers = get_writers_count(sb);
+		if (writers)
+			schedule();
+		finish_wait(&sb->s_writers_wait, &wait);
+	} while (writers);
 
 	sync_filesystem(sb);
 
@@ -1165,6 +1282,7 @@ int freeze_super(struct super_block *sb)
 		if (ret) {
 			printk(KERN_ERR
 				"VFS:Filesystem freeze failed\n");
+			rwsem_release(&sb->s_writers_lock_map, 1, _THIS_IP_);
 			sb->s_frozen = SB_UNFROZEN;
 			deactivate_locked_super(sb);
 			return ret;
@@ -1206,6 +1324,7 @@ int thaw_super(struct super_block *sb)
 	}
 
 out:
+	rwsem_release(&sb->s_writers_lock_map, 1, _THIS_IP_);
 	sb->s_frozen = SB_UNFROZEN;
 	smp_wmb();
 	wake_up(&sb->s_wait_unfrozen);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e313022..297b263 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -10,6 +10,7 @@
 #include <linux/ioctl.h>
 #include <linux/blk_types.h>
 #include <linux/types.h>
+#include <linux/lockdep.h>
 
 /*
  * It's silly to have NR_OPEN bigger than NR_FILE, but you can change
@@ -1445,6 +1446,16 @@ struct super_block {
 
 	int			s_frozen;
 	wait_queue_head_t	s_wait_unfrozen;
+#ifdef CONFIG_SMP
+	int __percpu 		*s_writers;	/* counter of running writes */
+#else
+	int			s_writers;	/* counter of running writes */
+#endif
+	wait_queue_head_t	s_writers_wait;	/* queue for waiting for
+						   writers to finish */
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map	s_writers_lock_map;
+#endif
 
 	char s_id[32];				/* Informational name */
 	u8 s_uuid[16];				/* UUID */
@@ -1501,6 +1512,9 @@ enum {
 #define vfs_check_frozen(sb, level) \
 	wait_event((sb)->s_wait_unfrozen, ((sb)->s_frozen < (level)))
 
+void sb_end_write(struct super_block *sb);
+void sb_start_write(struct super_block *sb);
+
 /*
  * until VFS tracks user namespaces for inodes, just make all files
  * belong to init_user_ns
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/4] vfs: Protect write paths by sb_start_write - sb_end_write
  2012-01-12  1:20 [PATCH 0/4] Fix filesystem freezing Jan Kara
  2012-01-12  1:20 ` [PATCH 1/4] fs: Improve filesystem freezing handling Jan Kara
@ 2012-01-12  1:20 ` Jan Kara
  2012-01-12 19:56   ` Andreas Dilger
  2012-01-12  1:20 ` [PATCH 3/4] ext4: Protect ext4_page_mkwrite with " Jan Kara
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2012-01-12  1:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: LKML, linux-ext4, xfs, Eric Sandeen, Dave Chinner,
	Surbhi Palande, Kamal Mostafa, Christoph Hellwig, Jan Kara

There are three entry points which dirty pages in a filesystem.  mmap (handled
by block_page_mkwrite()), buffered write (handled by
__generic_file_aio_write()), and truncate (it can dirty last partial page -
handled by do_truncate()). Protect these places with sb_start_write() and
sb_end_write().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/buffer.c  |   18 ++----------------
 fs/open.c    |    6 ++++++
 mm/filemap.c |    3 ++-
 3 files changed, 10 insertions(+), 17 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 19d8eb7..8519405 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2371,18 +2371,7 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 
 	if (unlikely(ret < 0))
 		goto out_unlock;
-	/*
-	 * Freezing in progress? We check after the page is marked dirty and
-	 * with page lock held so if the test here fails, we are sure freezing
-	 * code will wait during syncing until the page fault is done - at that
-	 * point page will be dirty and unlocked so freezing code will write it
-	 * and writeprotect it again.
-	 */
 	set_page_dirty(page);
-	if (inode->i_sb->s_frozen != SB_UNFROZEN) {
-		ret = -EAGAIN;
-		goto out_unlock;
-	}
 	wait_on_page_writeback(page);
 	return 0;
 out_unlock:
@@ -2397,12 +2386,9 @@ int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 	int ret;
 	struct super_block *sb = vma->vm_file->f_path.dentry->d_inode->i_sb;
 
-	/*
-	 * This check is racy but catches the common case. The check in
-	 * __block_page_mkwrite() is reliable.
-	 */
-	vfs_check_frozen(sb, SB_FREEZE_WRITE);
+	sb_start_write(sb);
 	ret = __block_page_mkwrite(vma, vmf, get_block);
+	sb_end_write(sb);
 	return block_page_mkwrite_return(ret);
 }
 EXPORT_SYMBOL(block_page_mkwrite);
diff --git a/fs/open.c b/fs/open.c
index 22c41b5..ee17c90 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -55,8 +55,14 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
 	if (ret)
 		newattrs.ia_valid |= ret | ATTR_FORCE;
 
+	/*
+	 * Truncate can dirty last partial page so we need protection against
+	 * filesystem freezing.
+	 */
 	mutex_lock(&dentry->d_inode->i_mutex);
+	sb_start_write(dentry->d_sb);
 	ret = notify_change(dentry, &newattrs);
+	sb_end_write(dentry->d_sb);
 	mutex_unlock(&dentry->d_inode->i_mutex);
 	return ret;
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index c0018f2..6566c73 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2529,7 +2529,7 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	count = ocount;
 	pos = *ppos;
 
-	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+	sb_start_write(inode->i_sb);
 
 	/* We can write back this queue in page reclaim */
 	current->backing_dev_info = mapping->backing_dev_info;
@@ -2601,6 +2601,7 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 				pos, ppos, count, written);
 	}
 out:
+	sb_end_write(inode->i_sb);
 	current->backing_dev_info = NULL;
 	return written ? written : err;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/4] ext4: Protect ext4_page_mkwrite with sb_start_write - sb_end_write
  2012-01-12  1:20 [PATCH 0/4] Fix filesystem freezing Jan Kara
  2012-01-12  1:20 ` [PATCH 1/4] fs: Improve filesystem freezing handling Jan Kara
  2012-01-12  1:20 ` [PATCH 2/4] vfs: Protect write paths by sb_start_write - sb_end_write Jan Kara
@ 2012-01-12  1:20 ` Jan Kara
  2012-01-12  1:20 ` [PATCH 4/4] xfs: Protect xfs_file_aio_write() " Jan Kara
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2012-01-12  1:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: LKML, linux-ext4, xfs, Eric Sandeen, Dave Chinner,
	Surbhi Palande, Kamal Mostafa, Christoph Hellwig, Jan Kara,
	Theodore Ts'o

Since ext4_page_mkwrite() calls into __block_page_mkwrite() it has to
provide freezing protection on it's own.

CC: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c |    7 ++-----
 1 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 848f436..fbe998b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4733,11 +4733,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	get_block_t *get_block;
 	int retries = 0;
 
-	/*
-	 * This check is racy but catches the common case. We rely on
-	 * __block_page_mkwrite() to do a reliable check.
-	 */
-	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+	sb_start_write(inode->i_sb);
 	/* Delalloc case is easy... */
 	if (test_opt(inode->i_sb, DELALLOC) &&
 	    !ext4_should_journal_data(inode) &&
@@ -4805,5 +4801,6 @@ retry_alloc:
 out_ret:
 	ret = block_page_mkwrite_return(ret);
 out:
+	sb_end_write(inode->i_sb);
 	return ret;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/4] xfs: Protect xfs_file_aio_write() with sb_start_write - sb_end_write
  2012-01-12  1:20 [PATCH 0/4] Fix filesystem freezing Jan Kara
                   ` (2 preceding siblings ...)
  2012-01-12  1:20 ` [PATCH 3/4] ext4: Protect ext4_page_mkwrite with " Jan Kara
@ 2012-01-12  1:20 ` Jan Kara
  2012-01-12 21:29   ` Al Viro
  2012-01-12  2:48 ` [PATCH 0/4] Fix filesystem freezing Dave Chinner
  2012-01-12 20:48 ` Ted Ts'o
  5 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2012-01-12  1:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: LKML, linux-ext4, xfs, Eric Sandeen, Dave Chinner,
	Surbhi Palande, Kamal Mostafa, Christoph Hellwig, Jan Kara,
	Ben Myers, Alex Elder

Replace racy xfs_wait_for_freeze() check with reliable sb_start_write()
- sb_end_write() locking. Due to lock ranking dictated by the page fault
code we have to call sb_start_write() after we acquire ilock.

CC: Ben Myers <bpm@sgi.com>
CC: Alex Elder <elder@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/xfs/xfs_file.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 753ed9b..c5f879b 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -862,9 +862,11 @@ xfs_file_dio_aio_write(
 		*iolock = XFS_IOLOCK_SHARED;
 	}
 
+	sb_start_write(inode->i_sb);
 	trace_xfs_file_direct_write(ip, count, iocb->ki_pos, 0);
 	ret = generic_file_direct_write(iocb, iovp,
 			&nr_segs, pos, &iocb->ki_pos, count, ocount);
+	sb_end_write(inode->i_sb);
 
 	/* No fallback to buffered IO on errors for XFS. */
 	ASSERT(ret < 0 || ret == count);
@@ -899,6 +901,7 @@ xfs_file_buffered_aio_write(
 	/* We can write back this queue in page reclaim */
 	current->backing_dev_info = mapping->backing_dev_info;
 
+	sb_start_write(inode->i_sb);
 write_retry:
 	trace_xfs_file_buffered_write(ip, count, iocb->ki_pos, 0);
 	ret = generic_file_buffered_write(iocb, iovp, nr_segs,
@@ -914,6 +917,7 @@ write_retry:
 		enospc = 1;
 		goto write_retry;
 	}
+	sb_end_write(inode->i_sb);
 	current->backing_dev_info = NULL;
 	return ret;
 }
@@ -945,8 +949,6 @@ xfs_file_aio_write(
 	if (ocount == 0)
 		return 0;
 
-	xfs_wait_for_freeze(ip->i_mount, SB_FREEZE_WRITE);
-
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
 		return -EIO;
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] Fix filesystem freezing
  2012-01-12  1:20 [PATCH 0/4] Fix filesystem freezing Jan Kara
                   ` (3 preceding siblings ...)
  2012-01-12  1:20 ` [PATCH 4/4] xfs: Protect xfs_file_aio_write() " Jan Kara
@ 2012-01-12  2:48 ` Dave Chinner
  2012-01-12 11:30   ` Jan Kara
  2012-01-12 20:48 ` Ted Ts'o
  5 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2012-01-12  2:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, LKML, linux-ext4, xfs, Eric Sandeen, Dave Chinner,
	Surbhi Palande, Kamal Mostafa, Christoph Hellwig

On Thu, Jan 12, 2012 at 02:20:49AM +0100, Jan Kara wrote:
> 
>   Hello,
> 
>   filesystem freezing is currently racy and thus we can end up with dirty data
> on frozen filesystem (see changelog of the first patch for detailed race
> description and proposed fix). This patch series aims at fixing this.

It only fixes the dirty data race (i.e. SB_FREEZE_WRITE). The same
race conditions exist for SB_FREEZE_TRANS on XFS, and so need the
same fix. That race has had one previous attempt at fixing it in
XFS but that's not possible:

b2ce397 Revert "xfs: fix filesystsem freeze race in xfs_trans_alloc"
7a249cf xfs: fix filesystsem freeze race in xfs_trans_alloc

It was looking at that problem earlier today that lead to the
solution Eric proposed. Essentially the method in these patches
needs to replace the xfs specifc m_active_trans counter and delay
during ->fs_freeze to prevent that race condition....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] Fix filesystem freezing
  2012-01-12  2:48 ` [PATCH 0/4] Fix filesystem freezing Dave Chinner
@ 2012-01-12 11:30   ` Jan Kara
  2012-01-13  0:09     ` Dave Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2012-01-12 11:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-fsdevel, LKML, linux-ext4, xfs, Eric Sandeen,
	Dave Chinner, Surbhi Palande, Kamal Mostafa, Christoph Hellwig

On Thu 12-01-12 13:48:41, Dave Chinner wrote:
> On Thu, Jan 12, 2012 at 02:20:49AM +0100, Jan Kara wrote:
> > 
> >   Hello,
> > 
> >   filesystem freezing is currently racy and thus we can end up with dirty data
> > on frozen filesystem (see changelog of the first patch for detailed race
> > description and proposed fix). This patch series aims at fixing this.
> 
> It only fixes the dirty data race (i.e. SB_FREEZE_WRITE). The same
> race conditions exist for SB_FREEZE_TRANS on XFS, and so need the
> same fix. That race has had one previous attempt at fixing it in
> XFS but that's not possible:
> 
> b2ce397 Revert "xfs: fix filesystsem freeze race in xfs_trans_alloc"
> 7a249cf xfs: fix filesystsem freeze race in xfs_trans_alloc
> 
> It was looking at that problem earlier today that lead to the
> solution Eric proposed. Essentially the method in these patches
> needs to replace the xfs specifc m_active_trans counter and delay
> during ->fs_freeze to prevent that race condition....
  OK, I see. I just checked ext4 to make sure and ext4 seems to get this
right. Looking into Christoph's original patch it shouldn't be hard to fix
it. Instead of:
        atomic_inc(&mp->m_active_trans);
 
        if (wait_for_freeze)
              xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);

we just need to do a bit more elaborate

retry:
        if (wait_for_freeze)
              xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
        atomic_inc(&mp->m_active_trans);
	if (wait_for_freeze && mp->m_super->s_frozen >= SB_FREEZE_TRANS) {
        	atomic_dec(&mp->m_active_trans);
		goto retry;
	}

Or does XFS support nested transactions (i.e. a thread already holding a
running transaction can call into xfs_trans_alloc() again)?
That would make things more complicated...

Using sb_start_write() instead of m_active_trans won't be that easy because
it can create A-A deadlocks (e.g. we do sb_start_write in
block_page_mkwrite() and then xfs_get_blocks() decides to start a
transaction and calls sb_start_write() again which might block if
filesystem freezing started in the mean time).

So it's up to XFS maintainers to decide what's best but I'd take
Christoph's patch with above fixup. I guess I'll put it in this series and
see what people say.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/4] fs: Improve filesystem freezing handling
  2012-01-12  1:20 ` [PATCH 1/4] fs: Improve filesystem freezing handling Jan Kara
@ 2012-01-12 19:53   ` Andreas Dilger
  2012-01-12 20:07     ` Jan Kara
  2012-01-12 22:57   ` Eric Sandeen
  2012-01-13  1:26   ` Dave Chinner
  2 siblings, 1 reply; 21+ messages in thread
From: Andreas Dilger @ 2012-01-12 19:53 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, LKML, linux-ext4, xfs, Eric Sandeen, Dave Chinner,
	Surbhi Palande, Kamal Mostafa, Christoph Hellwig

On 2012-01-11, at 6:20 PM, Jan Kara wrote:
> Currently, exclusion between ->page_mkwrite() and filesystem freezing has been
> handled by setting page dirty and then verifying s_frozen. This guaranteed that
> either the freezing code sees the faulted page, writes it, and writeprotects it
> again or we see s_frozen set and bail out of page fault. This works to protect
> from page being marked writeable while filesystem freezing is running but has
> an unpleasant artefact of leaving dirty (although unmodified and
> writeprotected) pages on frozen filesystem. This artefact then requires
> workarounds in writeback code and other places.
> 
> Also generally vfs_check_frozen() tests are racy since the filesystem can be
> frozen just after the test is performed. Thus in other write paths we can
> end up marking some pages or inodes dirty even though filesystem is already
> frozen. Again this creates problems with flusher thread hanging on frozen
> filesystem.
> 
> This patch aims at providing exclusion between write paths which dirty data (we
> don't have to worry about metadata since that is handled by filesystems in
> ->freeze_fs) and filesystem freezing. We implement a writer-freeze read-write
> semaphore in the superblock. Write paths which dirty data such as
> ->block_page_mkwrite() implementations, or ->aio_write() implementations hold
> reader side of the semaphore.  Filesystem freezing code holds the writer side.
> Only that we don't really want to bounce cachelines of the semaphore between
> CPUs for each write happening. So we implement the reader side of the semaphore
> as a per-cpu counter and the writer side is implemented using s_frozen
> superblock field.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
> fs/super.c         |  121 +++++++++++++++++++++++++++++++++++++++++++++++++++-
> include/linux/fs.h |   14 ++++++
> 2 files changed, 134 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index afd0f1a..c85c64c 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -32,12 +32,15 @@
> #include <linux/backing-dev.h>
> #include <linux/rculist_bl.h>
> #include <linux/cleancache.h>
> +#include <linux/lockdep.h>
> #include "internal.h"
> 
> 
> LIST_HEAD(super_blocks);
> DEFINE_SPINLOCK(sb_lock);
> 
> +static struct lock_class_key sb_writers_key;
> +
> /*
>  * One thing we have to be careful of with a per-sb shrinker is that we don't
>  * drop the last active reference to the superblock from within the shrinker.
> @@ -183,6 +186,13 @@ static struct super_block *alloc_super(struct file_system_type *type)
> 		s->s_shrink.seeks = DEFAULT_SEEKS;
> 		s->s_shrink.shrink = prune_super;
> 		s->s_shrink.batch = 1024;
> +
> +		init_waitqueue_head(&s->s_writers_wait);
> +#ifdef CONFIG_SMP
> +		s->s_page_faults = alloc_percpu(int);
> +#endif
> +		lockdep_init_map(&s->s_writers_lock_map, "sb_writers",
> +				 &sb_writers_key, 0);
> 	}
> out:
> 	return s;
> @@ -1126,6 +1136,84 @@ out:
> }
> 
> /**
> + * sb_start_write - drop write access to a superblock
> + * @sb: the super we wrote to
> + *
> + * Decrement number of writers to the filesystem and wake up possible
> + * waiters wanting to freeze the filesystem.
> + */
> +void sb_end_write(struct super_block *sb)
> +{
> +#ifdef CONFIG_SMP
> +	this_cpu_dec(sb->s_writers);
> +#else
> +	preempt_disable();
> +	sb->s_writers--;
> +	preempt_enable();
> +#endif
> +	/*
> +	 * Make sure s_writers are updated before we wake up waiters in
> +	 * freeze_super().
> +	 */
> +	smp_mb();
> +	if (waitqueue_active(&sb->s_writers_wait))
> +		wake_up(&sb->s_writers_wait);
> +	rwsem_release(&sb->s_writers_lock_map, 1, _RET_IP_);
> +}

Since this function is needed for calling __block_page_mkwrite(), which is
EXPORT_SYMBOL(), both sb_start_write() and sb_end_write() themselves need
to be EXPORT_SYMBOL().

> +/**
> + * sb_start_write - get write access to a superblock
> + * @sb: the super we write to
> + *
> + * When a process wants to write data to a filesystem (i.e. dirty a page),
> + * it should embed the operation in a sb_start_write() - sb_end_write() pair
> + * to get exclusion against filesystem freezing. This function increments
> + * number of writers to the filesystem and waits if filesystem is frozen until
> + * it is thawed.
> + */
> +void sb_start_write(struct super_block *sb)
> +{
> +retry:
> +	rwsem_acquire_read(&sb->s_writers_lock_map, 0, 0, _RET_IP_);
> +	vfs_check_frozen(sb, SB_FREEZE_WRITE);
> +#ifdef CONFIG_SMP
> +	this_cpu_inc(sb->s_writers);
> +#else
> +	preempt_disable();
> +	sb->s_writers++;
> +	preempt_enable();
> +#endif
> +	/*
> +	 * Make sure s_writers are updated before we check s_frozen.
> +	 * freeze_super() first sets s_frozen and then checks s_writers.
> +	 */
> +	smp_mb();
> +	if (sb->s_frozen != SB_UNFROZEN) {
> +		sb_end_write(sb);
> +		goto retry;
> +	}
> +}
> +
> +/*
> + * Get number of writers to the superblock
> + */
> +static int get_writers_count(struct super_block *sb)
> +{
> +	int writers;
> +#ifdef CONFIG_SMP
> +	int cpu;
> +
> +	writers = 0;
> +	for_each_possible_cpu(cpu) {
> +		writers += *per_cpu_ptr(sb->s_writers, cpu);
> +	}
> +#else
> +	writers = sb->s_writers;
> +#endif
> +	return writers;
> +}
> +
> +/**
>  * freeze_super - lock the filesystem and force it into a consistent state
>  * @sb: the super to lock
>  *
> @@ -1136,6 +1224,7 @@ out:
> int freeze_super(struct super_block *sb)
> {
> 	int ret;
> +	int writers;
> 
> 	atomic_inc(&sb->s_active);
> 	down_write(&sb->s_umount);
> @@ -1151,8 +1240,36 @@ int freeze_super(struct super_block *sb)
> 		return 0;
> 	}
> 
> +	rwsem_acquire(&sb->s_writers_lock_map, 0, 0, _THIS_IP_);
> 	sb->s_frozen = SB_FREEZE_WRITE;
> -	smp_wmb();
> +	/*
> +	 * Now wait for all page faults to finish. ->page_mkwrite()
> +	 * implementations must call vfs_check_frozen() before starting
> +	 * a fault so that we cannot livelock here. Because of that we
> +	 * are guaranteed that from this moment on new ->page_mkwrite()
> +	 * calls will block and we just have to wait for s_page_faults
> +	 * to drop to zero (in a sum).
> +	 */
> +	do {
> +		DEFINE_WAIT(wait);
> +
> +		/*
> +		 * We use a barrier in prepare_to_wait() to separate setting
> +		 * of s_frozen and checking of s_writers
> +		 */
> +		prepare_to_wait(&sb->s_writers_wait, &wait,
> +				TASK_UNINTERRUPTIBLE);
> +		/*
> +		 * We must iterate over all (even offline) CPUs because of CPU
> + 		 * hotplug their entries could still be non-zero. This is slow
> +		 * when lots of CPUs are configured but hey, filesystem freezing
> +		 * isn't exactly cheap anyway.
> +		 */
> +		writers = get_writers_count(sb);
> +		if (writers)
> +			schedule();
> +		finish_wait(&sb->s_writers_wait, &wait);
> +	} while (writers);
> 
> 	sync_filesystem(sb);
> 
> @@ -1165,6 +1282,7 @@ int freeze_super(struct super_block *sb)
> 		if (ret) {
> 			printk(KERN_ERR
> 				"VFS:Filesystem freeze failed\n");
> +			rwsem_release(&sb->s_writers_lock_map, 1, _THIS_IP_);
> 			sb->s_frozen = SB_UNFROZEN;
> 			deactivate_locked_super(sb);
> 			return ret;
> @@ -1206,6 +1324,7 @@ int thaw_super(struct super_block *sb)
> 	}
> 
> out:
> +	rwsem_release(&sb->s_writers_lock_map, 1, _THIS_IP_);
> 	sb->s_frozen = SB_UNFROZEN;
> 	smp_wmb();
> 	wake_up(&sb->s_wait_unfrozen);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e313022..297b263 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -10,6 +10,7 @@
> #include <linux/ioctl.h>
> #include <linux/blk_types.h>
> #include <linux/types.h>
> +#include <linux/lockdep.h>
> 
> /*
>  * It's silly to have NR_OPEN bigger than NR_FILE, but you can change
> @@ -1445,6 +1446,16 @@ struct super_block {
> 
> 	int			s_frozen;
> 	wait_queue_head_t	s_wait_unfrozen;
> +#ifdef CONFIG_SMP
> +	int __percpu 		*s_writers;	/* counter of running writes */
> +#else
> +	int			s_writers;	/* counter of running writes */
> +#endif
> +	wait_queue_head_t	s_writers_wait;	/* queue for waiting for
> +						   writers to finish */
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +	struct lockdep_map	s_writers_lock_map;
> +#endif
> 
> 	char s_id[32];				/* Informational name */
> 	u8 s_uuid[16];				/* UUID */
> @@ -1501,6 +1512,9 @@ enum {
> #define vfs_check_frozen(sb, level) \
> 	wait_event((sb)->s_wait_unfrozen, ((sb)->s_frozen < (level)))
> 
> +void sb_end_write(struct super_block *sb);
> +void sb_start_write(struct super_block *sb);
> +
> /*
>  * until VFS tracks user namespaces for inodes, just make all files
>  * belong to init_user_ns
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/4] vfs: Protect write paths by sb_start_write - sb_end_write
  2012-01-12  1:20 ` [PATCH 2/4] vfs: Protect write paths by sb_start_write - sb_end_write Jan Kara
@ 2012-01-12 19:56   ` Andreas Dilger
  2012-01-12 20:11     ` Jan Kara
  0 siblings, 1 reply; 21+ messages in thread
From: Andreas Dilger @ 2012-01-12 19:56 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, LKML, linux-ext4, xfs, Eric Sandeen, Dave Chinner,
	Surbhi Palande, Kamal Mostafa, Christoph Hellwig

On 2012-01-11, at 6:20 PM, Jan Kara wrote:
> There are three entry points which dirty pages in a filesystem.  mmap (handled
> by block_page_mkwrite()), buffered write (handled by
> __generic_file_aio_write()), and truncate (it can dirty last partial page -
> handled by do_truncate()). Protect these places with sb_start_write() and
> sb_end_write().
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
> fs/buffer.c  |   18 ++----------------
> fs/open.c    |    6 ++++++
> mm/filemap.c |    3 ++-
> 3 files changed, 10 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 19d8eb7..8519405 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2371,18 +2371,7 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,

The comment for __block_page_mkwrite() needs to be updated to reference
sb_start_write() and sb_end_write() instead of vfs_check_frozen().

Cheers, Andreas

> 	if (unlikely(ret < 0))
> 		goto out_unlock;
> -	/*
> -	 * Freezing in progress? We check after the page is marked dirty and
> -	 * with page lock held so if the test here fails, we are sure freezing
> -	 * code will wait during syncing until the page fault is done - at that
> -	 * point page will be dirty and unlocked so freezing code will write it
> -	 * and writeprotect it again.
> -	 */
> 	set_page_dirty(page);
> -	if (inode->i_sb->s_frozen != SB_UNFROZEN) {
> -		ret = -EAGAIN;
> -		goto out_unlock;
> -	}
> 	wait_on_page_writeback(page);
> 	return 0;
> out_unlock:
> @@ -2397,12 +2386,9 @@ int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> 	int ret;
> 	struct super_block *sb = vma->vm_file->f_path.dentry->d_inode->i_sb;
> 
> -	/*
> -	 * This check is racy but catches the common case. The check in
> -	 * __block_page_mkwrite() is reliable.
> -	 */
> -	vfs_check_frozen(sb, SB_FREEZE_WRITE);
> +	sb_start_write(sb);
> 	ret = __block_page_mkwrite(vma, vmf, get_block);
> +	sb_end_write(sb);
> 	return block_page_mkwrite_return(ret);
> }
> EXPORT_SYMBOL(block_page_mkwrite);
> diff --git a/fs/open.c b/fs/open.c
> index 22c41b5..ee17c90 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -55,8 +55,14 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
> 	if (ret)
> 		newattrs.ia_valid |= ret | ATTR_FORCE;
> 
> +	/*
> +	 * Truncate can dirty last partial page so we need protection against
> +	 * filesystem freezing.
> +	 */
> 	mutex_lock(&dentry->d_inode->i_mutex);
> +	sb_start_write(dentry->d_sb);
> 	ret = notify_change(dentry, &newattrs);
> +	sb_end_write(dentry->d_sb);
> 	mutex_unlock(&dentry->d_inode->i_mutex);
> 	return ret;
> }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c0018f2..6566c73 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2529,7 +2529,7 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
> 	count = ocount;
> 	pos = *ppos;
> 
> -	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
> +	sb_start_write(inode->i_sb);
> 
> 	/* We can write back this queue in page reclaim */
> 	current->backing_dev_info = mapping->backing_dev_info;
> @@ -2601,6 +2601,7 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
> 				pos, ppos, count, written);
> 	}
> out:
> +	sb_end_write(inode->i_sb);
> 	current->backing_dev_info = NULL;
> 	return written ? written : err;
> }
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/4] fs: Improve filesystem freezing handling
  2012-01-12 19:53   ` Andreas Dilger
@ 2012-01-12 20:07     ` Jan Kara
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2012-01-12 20:07 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Jan Kara, linux-fsdevel, LKML, linux-ext4, xfs, Eric Sandeen,
	Dave Chinner, Surbhi Palande, Kamal Mostafa, Christoph Hellwig

On Thu 12-01-12 12:53:35, Andreas Dilger wrote:
> On 2012-01-11, at 6:20 PM, Jan Kara wrote:
> > /**
> > + * sb_start_write - drop write access to a superblock
> > + * @sb: the super we wrote to
> > + *
> > + * Decrement number of writers to the filesystem and wake up possible
> > + * waiters wanting to freeze the filesystem.
> > + */
> > +void sb_end_write(struct super_block *sb)
> > +{
> > +#ifdef CONFIG_SMP
> > +	this_cpu_dec(sb->s_writers);
> > +#else
> > +	preempt_disable();
> > +	sb->s_writers--;
> > +	preempt_enable();
> > +#endif
> > +	/*
> > +	 * Make sure s_writers are updated before we wake up waiters in
> > +	 * freeze_super().
> > +	 */
> > +	smp_mb();
> > +	if (waitqueue_active(&sb->s_writers_wait))
> > +		wake_up(&sb->s_writers_wait);
> > +	rwsem_release(&sb->s_writers_lock_map, 1, _RET_IP_);
> > +}
> 
> Since this function is needed for calling __block_page_mkwrite(), which is
> EXPORT_SYMBOL(), both sb_start_write() and sb_end_write() themselves need
> to be EXPORT_SYMBOL().
  Good point. Fixed. Thanks.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/4] vfs: Protect write paths by sb_start_write - sb_end_write
  2012-01-12 19:56   ` Andreas Dilger
@ 2012-01-12 20:11     ` Jan Kara
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2012-01-12 20:11 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Jan Kara, linux-fsdevel, LKML, linux-ext4, xfs, Eric Sandeen,
	Dave Chinner, Surbhi Palande, Kamal Mostafa, Christoph Hellwig

On Thu 12-01-12 12:56:01, Andreas Dilger wrote:
> On 2012-01-11, at 6:20 PM, Jan Kara wrote:
> > There are three entry points which dirty pages in a filesystem.  mmap (handled
> > by block_page_mkwrite()), buffered write (handled by
> > __generic_file_aio_write()), and truncate (it can dirty last partial page -
> > handled by do_truncate()). Protect these places with sb_start_write() and
> > sb_end_write().
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> > fs/buffer.c  |   18 ++----------------
> > fs/open.c    |    6 ++++++
> > mm/filemap.c |    3 ++-
> > 3 files changed, 10 insertions(+), 17 deletions(-)
> > 
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index 19d8eb7..8519405 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -2371,18 +2371,7 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> 
> The comment for __block_page_mkwrite() needs to be updated to reference
> sb_start_write() and sb_end_write() instead of vfs_check_frozen().
  Thanks. Fixed.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] Fix filesystem freezing
  2012-01-12  1:20 [PATCH 0/4] Fix filesystem freezing Jan Kara
                   ` (4 preceding siblings ...)
  2012-01-12  2:48 ` [PATCH 0/4] Fix filesystem freezing Dave Chinner
@ 2012-01-12 20:48 ` Ted Ts'o
  2012-01-12 21:38   ` Jan Kara
  5 siblings, 1 reply; 21+ messages in thread
From: Ted Ts'o @ 2012-01-12 20:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, LKML, linux-ext4, xfs, Eric Sandeen, Dave Chinner,
	Surbhi Palande, Kamal Mostafa, Christoph Hellwig

On Thu, Jan 12, 2012 at 02:20:49AM +0100, Jan Kara wrote:
> reproduce deadlocks due to dirty data as well so please test these patches out.
> Thanks. Also review of patches is welcome.

This patch series looks good to me (modulo issues others have already
pointed out).

Acked-by: "Theodore Ts'o" <tytso@mit.edu>

I assume you're going to send this up through Al's vfs tree?

  	 	      	      	      	      - Ted

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/4] xfs: Protect xfs_file_aio_write() with sb_start_write - sb_end_write
  2012-01-12  1:20 ` [PATCH 4/4] xfs: Protect xfs_file_aio_write() " Jan Kara
@ 2012-01-12 21:29   ` Al Viro
  2012-01-12 21:36     ` Jan Kara
  0 siblings, 1 reply; 21+ messages in thread
From: Al Viro @ 2012-01-12 21:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, LKML, linux-ext4, xfs, Eric Sandeen, Dave Chinner,
	Surbhi Palande, Kamal Mostafa, Christoph Hellwig, Ben Myers,
	Alex Elder

On Thu, Jan 12, 2012 at 02:20:53AM +0100, Jan Kara wrote:
> Replace racy xfs_wait_for_freeze() check with reliable sb_start_write()
> - sb_end_write() locking. Due to lock ranking dictated by the page fault
> code we have to call sb_start_write() after we acquire ilock.
> 
> CC: Ben Myers <bpm@sgi.com>
> CC: Alex Elder <elder@kernel.org>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/xfs/xfs_file.c |    6 ++++--
>  1 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 753ed9b..c5f879b 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -862,9 +862,11 @@ xfs_file_dio_aio_write(
>  		*iolock = XFS_IOLOCK_SHARED;
>  	}
>  
> +	sb_start_write(inode->i_sb);
>  	trace_xfs_file_direct_write(ip, count, iocb->ki_pos, 0);
>  	ret = generic_file_direct_write(iocb, iovp,
>  			&nr_segs, pos, &iocb->ki_pos, count, ocount);
> +	sb_end_write(inode->i_sb);

What lock ordering do you have in mind?  Explicit description in fs/super.c,
please...

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/4] xfs: Protect xfs_file_aio_write() with sb_start_write - sb_end_write
  2012-01-12 21:29   ` Al Viro
@ 2012-01-12 21:36     ` Jan Kara
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2012-01-12 21:36 UTC (permalink / raw)
  To: Al Viro
  Cc: Jan Kara, linux-fsdevel, LKML, linux-ext4, xfs, Eric Sandeen,
	Dave Chinner, Surbhi Palande, Kamal Mostafa, Christoph Hellwig,
	Ben Myers, Alex Elder

On Thu 12-01-12 21:29:41, Al Viro wrote:
> On Thu, Jan 12, 2012 at 02:20:53AM +0100, Jan Kara wrote:
> > Replace racy xfs_wait_for_freeze() check with reliable sb_start_write()
> > - sb_end_write() locking. Due to lock ranking dictated by the page fault
> > code we have to call sb_start_write() after we acquire ilock.
> > 
> > CC: Ben Myers <bpm@sgi.com>
> > CC: Alex Elder <elder@kernel.org>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/xfs/xfs_file.c |    6 ++++--
> >  1 files changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 753ed9b..c5f879b 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -862,9 +862,11 @@ xfs_file_dio_aio_write(
> >  		*iolock = XFS_IOLOCK_SHARED;
> >  	}
> >  
> > +	sb_start_write(inode->i_sb);
> >  	trace_xfs_file_direct_write(ip, count, iocb->ki_pos, 0);
> >  	ret = generic_file_direct_write(iocb, iovp,
> >  			&nr_segs, pos, &iocb->ki_pos, count, ocount);
> > +	sb_end_write(inode->i_sb);
> 
> What lock ordering do you have in mind?  Explicit description in fs/super.c,
> please...
  Good point. Will add explanatory comment. Thanks.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] Fix filesystem freezing
  2012-01-12 20:48 ` Ted Ts'o
@ 2012-01-12 21:38   ` Jan Kara
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2012-01-12 21:38 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Jan Kara, linux-fsdevel, LKML, linux-ext4, xfs, Eric Sandeen,
	Dave Chinner, Surbhi Palande, Kamal Mostafa, Christoph Hellwig

On Thu 12-01-12 15:48:26, Ted Tso wrote:
> On Thu, Jan 12, 2012 at 02:20:49AM +0100, Jan Kara wrote:
> > reproduce deadlocks due to dirty data as well so please test these patches out.
> > Thanks. Also review of patches is welcome.
> 
> This patch series looks good to me (modulo issues others have already
> pointed out).
> 
> Acked-by: "Theodore Ts'o" <tytso@mit.edu>
  Thanks.

> I assume you're going to send this up through Al's vfs tree?
  Yes.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/4] fs: Improve filesystem freezing handling
  2012-01-12  1:20 ` [PATCH 1/4] fs: Improve filesystem freezing handling Jan Kara
  2012-01-12 19:53   ` Andreas Dilger
@ 2012-01-12 22:57   ` Eric Sandeen
  2012-01-12 23:15     ` Jan Kara
  2012-01-13  1:26   ` Dave Chinner
  2 siblings, 1 reply; 21+ messages in thread
From: Eric Sandeen @ 2012-01-12 22:57 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, LKML, linux-ext4, xfs, Dave Chinner,
	Surbhi Palande, Kamal Mostafa, Christoph Hellwig

On 1/11/12 7:20 PM, Jan Kara wrote:
> Currently, exclusion between ->page_mkwrite() and filesystem freezing has been
> handled by setting page dirty and then verifying s_frozen. This guaranteed that
> either the freezing code sees the faulted page, writes it, and writeprotects it
> again or we see s_frozen set and bail out of page fault. This works to protect
> from page being marked writeable while filesystem freezing is running but has
> an unpleasant artefact of leaving dirty (although unmodified and
> writeprotected) pages on frozen filesystem. This artefact then requires
> workarounds in writeback code and other places.
> 
> Also generally vfs_check_frozen() tests are racy since the filesystem can be
> frozen just after the test is performed. Thus in other write paths we can
> end up marking some pages or inodes dirty even though filesystem is already
> frozen. Again this creates problems with flusher thread hanging on frozen
> filesystem.
> 
> This patch aims at providing exclusion between write paths which dirty data (we
> don't have to worry about metadata since that is handled by filesystems in
> ->freeze_fs) and filesystem freezing. We implement a writer-freeze read-write
> semaphore in the superblock. Write paths which dirty data such as
> ->block_page_mkwrite() implementations, or ->aio_write() implementations hold
> reader side of the semaphore.  Filesystem freezing code holds the writer side.
> Only that we don't really want to bounce cachelines of the semaphore between
> CPUs for each write happening. So we implement the reader side of the semaphore
> as a per-cpu counter and the writer side is implemented using s_frozen
> superblock field.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/super.c         |  121 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/fs.h |   14 ++++++
>  2 files changed, 134 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index afd0f1a..c85c64c 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -32,12 +32,15 @@
>  #include <linux/backing-dev.h>
>  #include <linux/rculist_bl.h>
>  #include <linux/cleancache.h>
> +#include <linux/lockdep.h>
>  #include "internal.h"
>  
>  
>  LIST_HEAD(super_blocks);
>  DEFINE_SPINLOCK(sb_lock);
>  
> +static struct lock_class_key sb_writers_key;
> +
>  /*
>   * One thing we have to be careful of with a per-sb shrinker is that we don't
>   * drop the last active reference to the superblock from within the shrinker.
> @@ -183,6 +186,13 @@ static struct super_block *alloc_super(struct file_system_type *type)
>  		s->s_shrink.seeks = DEFAULT_SEEKS;
>  		s->s_shrink.shrink = prune_super;
>  		s->s_shrink.batch = 1024;
> +
> +		init_waitqueue_head(&s->s_writers_wait);
> +#ifdef CONFIG_SMP
> +		s->s_page_faults = alloc_percpu(int);

isn't this s->s_writers?  s->s_page_faults isn't defined anywhere.

> +#endif
> +		lockdep_init_map(&s->s_writers_lock_map, "sb_writers",
> +				 &sb_writers_key, 0);
>  	}
>  out:
>  	return s;
> @@ -1126,6 +1136,84 @@ out:
>  }
>  
>  /**
> + * sb_start_write - drop write access to a superblock
      ^^^^^^^^^^^^^^

s/b sb_end_write

> + * @sb: the super we wrote to
> + *
> + * Decrement number of writers to the filesystem and wake up possible
> + * waiters wanting to freeze the filesystem.
> + */
> +void sb_end_write(struct super_block *sb)
> +{
> +#ifdef CONFIG_SMP
> +	this_cpu_dec(sb->s_writers);
> +#else
> +	preempt_disable();
> +	sb->s_writers--;
> +	preempt_enable();
> +#endif
> +	/*
> +	 * Make sure s_writers are updated before we wake up waiters in
> +	 * freeze_super().
> +	 */
> +	smp_mb();
> +	if (waitqueue_active(&sb->s_writers_wait))
> +		wake_up(&sb->s_writers_wait);
> +	rwsem_release(&sb->s_writers_lock_map, 1, _RET_IP_);
> +}
> +
> +/**
> + * sb_start_write - get write access to a superblock
> + * @sb: the super we write to
> + *
> + * When a process wants to write data to a filesystem (i.e. dirty a page),
> + * it should embed the operation in a sb_start_write() - sb_end_write() pair
> + * to get exclusion against filesystem freezing. This function increments
> + * number of writers to the filesystem and waits if filesystem is frozen until
> + * it is thawed.
> + */
> +void sb_start_write(struct super_block *sb)
> +{
> +retry:
> +	rwsem_acquire_read(&sb->s_writers_lock_map, 0, 0, _RET_IP_);
> +	vfs_check_frozen(sb, SB_FREEZE_WRITE);
> +#ifdef CONFIG_SMP
> +	this_cpu_inc(sb->s_writers);
> +#else
> +	preempt_disable();
> +	sb->s_writers++;
> +	preempt_enable();
> +#endif
> +	/*
> +	 * Make sure s_writers are updated before we check s_frozen.
> +	 * freeze_super() first sets s_frozen and then checks s_writers.
> +	 */
> +	smp_mb();
> +	if (sb->s_frozen != SB_UNFROZEN) {
> +		sb_end_write(sb);
> +		goto retry;
> +	}
> +}
> +
> +/*
> + * Get number of writers to the superblock
> + */
> +static int get_writers_count(struct super_block *sb)
> +{
> +	int writers;
> +#ifdef CONFIG_SMP
> +	int cpu;
> +
> +	writers = 0;
> +	for_each_possible_cpu(cpu) {
> +		writers += *per_cpu_ptr(sb->s_writers, cpu);
> +	}
> +#else
> +	writers = sb->s_writers;
> +#endif
> +	return writers;
> +}
> +
> +/**
>   * freeze_super - lock the filesystem and force it into a consistent state
>   * @sb: the super to lock
>   *
> @@ -1136,6 +1224,7 @@ out:
>  int freeze_super(struct super_block *sb)
>  {
>  	int ret;
> +	int writers;
>  
>  	atomic_inc(&sb->s_active);
>  	down_write(&sb->s_umount);
> @@ -1151,8 +1240,36 @@ int freeze_super(struct super_block *sb)
>  		return 0;
>  	}
>  
> +	rwsem_acquire(&sb->s_writers_lock_map, 0, 0, _THIS_IP_);
>  	sb->s_frozen = SB_FREEZE_WRITE;
> -	smp_wmb();
> +	/*
> +	 * Now wait for all page faults to finish. ->page_mkwrite()
> +	 * implementations must call vfs_check_frozen() before starting
> +	 * a fault so that we cannot livelock here. Because of that we
> +	 * are guaranteed that from this moment on new ->page_mkwrite()
> +	 * calls will block and we just have to wait for s_page_faults

wait for s_writers, right?

> +	 * to drop to zero (in a sum).
> +	 */
> +	do {
> +		DEFINE_WAIT(wait);
> +
> +		/*
> +		 * We use a barrier in prepare_to_wait() to separate setting
> +		 * of s_frozen and checking of s_writers
> +		 */
> +		prepare_to_wait(&sb->s_writers_wait, &wait,
> +				TASK_UNINTERRUPTIBLE);
> +		/*
> +		 * We must iterate over all (even offline) CPUs because of CPU
> + 		 * hotplug their entries could still be non-zero. This is slow
> +		 * when lots of CPUs are configured but hey, filesystem freezing
> +		 * isn't exactly cheap anyway.
> +		 */
> +		writers = get_writers_count(sb);
> +		if (writers)
> +			schedule();
> +		finish_wait(&sb->s_writers_wait, &wait);
> +	} while (writers);
>  
>  	sync_filesystem(sb);
>  
> @@ -1165,6 +1282,7 @@ int freeze_super(struct super_block *sb)
>  		if (ret) {
>  			printk(KERN_ERR
>  				"VFS:Filesystem freeze failed\n");
> +			rwsem_release(&sb->s_writers_lock_map, 1, _THIS_IP_);
>  			sb->s_frozen = SB_UNFROZEN;
>  			deactivate_locked_super(sb);
>  			return ret;
> @@ -1206,6 +1324,7 @@ int thaw_super(struct super_block *sb)
>  	}
>  
>  out:
> +	rwsem_release(&sb->s_writers_lock_map, 1, _THIS_IP_);
>  	sb->s_frozen = SB_UNFROZEN;
>  	smp_wmb();
>  	wake_up(&sb->s_wait_unfrozen);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e313022..297b263 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -10,6 +10,7 @@
>  #include <linux/ioctl.h>
>  #include <linux/blk_types.h>
>  #include <linux/types.h>
> +#include <linux/lockdep.h>
>  
>  /*
>   * It's silly to have NR_OPEN bigger than NR_FILE, but you can change
> @@ -1445,6 +1446,16 @@ struct super_block {
>  
>  	int			s_frozen;
>  	wait_queue_head_t	s_wait_unfrozen;
> +#ifdef CONFIG_SMP
> +	int __percpu 		*s_writers;	/* counter of running writes */
> +#else
> +	int			s_writers;	/* counter of running writes */
> +#endif
> +	wait_queue_head_t	s_writers_wait;	/* queue for waiting for
> +						   writers to finish */
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +	struct lockdep_map	s_writers_lock_map;
> +#endif
>  
>  	char s_id[32];				/* Informational name */
>  	u8 s_uuid[16];				/* UUID */
> @@ -1501,6 +1512,9 @@ enum {
>  #define vfs_check_frozen(sb, level) \
>  	wait_event((sb)->s_wait_unfrozen, ((sb)->s_frozen < (level)))
>  
> +void sb_end_write(struct super_block *sb);
> +void sb_start_write(struct super_block *sb);
> +
>  /*
>   * until VFS tracks user namespaces for inodes, just make all files
>   * belong to init_user_ns


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/4] fs: Improve filesystem freezing handling
  2012-01-12 22:57   ` Eric Sandeen
@ 2012-01-12 23:15     ` Jan Kara
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2012-01-12 23:15 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Jan Kara, linux-fsdevel, LKML, linux-ext4, xfs, Dave Chinner,
	Surbhi Palande, Kamal Mostafa, Christoph Hellwig

On Thu 12-01-12 16:57:42, Eric Sandeen wrote:
> On 1/11/12 7:20 PM, Jan Kara wrote:
> > @@ -183,6 +186,13 @@ static struct super_block *alloc_super(struct file_system_type *type)
> >  		s->s_shrink.seeks = DEFAULT_SEEKS;
> >  		s->s_shrink.shrink = prune_super;
> >  		s->s_shrink.batch = 1024;
> > +
> > +		init_waitqueue_head(&s->s_writers_wait);
> > +#ifdef CONFIG_SMP
> > +		s->s_page_faults = alloc_percpu(int);
> 
> isn't this s->s_writers?  s->s_page_faults isn't defined anywhere.
  Right. Leftover from original implementation and since I was doing
initial testing only using UML, I didn't spot this. Thanks.

> > +#endif
> > +		lockdep_init_map(&s->s_writers_lock_map, "sb_writers",
> > +				 &sb_writers_key, 0);
> >  	}
> >  out:
> >  	return s;
> > @@ -1126,6 +1136,84 @@ out:
> >  }
> >  
> >  /**
> > + * sb_start_write - drop write access to a superblock
>       ^^^^^^^^^^^^^^
> 
> s/b sb_end_write
  Fixed.

> > @@ -1136,6 +1224,7 @@ out:
> >  int freeze_super(struct super_block *sb)
> >  {
> >  	int ret;
> > +	int writers;
> >  
> >  	atomic_inc(&sb->s_active);
> >  	down_write(&sb->s_umount);
> > @@ -1151,8 +1240,36 @@ int freeze_super(struct super_block *sb)
> >  		return 0;
> >  	}
> >  
> > +	rwsem_acquire(&sb->s_writers_lock_map, 0, 0, _THIS_IP_);
> >  	sb->s_frozen = SB_FREEZE_WRITE;
> > -	smp_wmb();
> > +	/*
> > +	 * Now wait for all page faults to finish. ->page_mkwrite()
> > +	 * implementations must call vfs_check_frozen() before starting
> > +	 * a fault so that we cannot livelock here. Because of that we
> > +	 * are guaranteed that from this moment on new ->page_mkwrite()
> > +	 * calls will block and we just have to wait for s_page_faults
> 
> wait for s_writers, right?
  Yes. Fixed.

  Thanks for review.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] Fix filesystem freezing
  2012-01-12 11:30   ` Jan Kara
@ 2012-01-13  0:09     ` Dave Chinner
  2012-01-13 11:07       ` Jan Kara
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2012-01-13  0:09 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, LKML, linux-ext4, xfs, Eric Sandeen, Dave Chinner,
	Surbhi Palande, Kamal Mostafa, Christoph Hellwig

On Thu, Jan 12, 2012 at 12:30:31PM +0100, Jan Kara wrote:
> On Thu 12-01-12 13:48:41, Dave Chinner wrote:
> > On Thu, Jan 12, 2012 at 02:20:49AM +0100, Jan Kara wrote:
> > > 
> > >   Hello,
> > > 
> > >   filesystem freezing is currently racy and thus we can end up with dirty data
> > > on frozen filesystem (see changelog of the first patch for detailed race
> > > description and proposed fix). This patch series aims at fixing this.
> > 
> > It only fixes the dirty data race (i.e. SB_FREEZE_WRITE). The same
> > race conditions exist for SB_FREEZE_TRANS on XFS, and so need the
> > same fix. That race has had one previous attempt at fixing it in
> > XFS but that's not possible:
> > 
> > b2ce397 Revert "xfs: fix filesystsem freeze race in xfs_trans_alloc"
> > 7a249cf xfs: fix filesystsem freeze race in xfs_trans_alloc
> > 
> > It was looking at that problem earlier today that lead to the
> > solution Eric proposed. Essentially the method in these patches
> > needs to replace the xfs specifc m_active_trans counter and delay
> > during ->fs_freeze to prevent that race condition....
>   OK, I see. I just checked ext4 to make sure and ext4 seems to get this
> right. Looking into Christoph's original patch it shouldn't be hard to fix
> it. Instead of:
>         atomic_inc(&mp->m_active_trans);
>  
>         if (wait_for_freeze)
>               xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> 
> we just need to do a bit more elaborate
> 
> retry:
>         if (wait_for_freeze)
>               xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
>         atomic_inc(&mp->m_active_trans);
> 	if (wait_for_freeze && mp->m_super->s_frozen >= SB_FREEZE_TRANS) {
>         	atomic_dec(&mp->m_active_trans);
> 		goto retry;
> 	}
> 
> Or does XFS support nested transactions (i.e. a thread already holding a
> running transaction can call into xfs_trans_alloc() again)?
> That would make things more complicated...

You're still missing the point - that this isn't an XFS specific
problem or that the write problem is a ext4 specific problem. The
problem is that these are freeze state transition problems -
something that can affect every filesystem because the freeze code
is generic.  Quite frankly, I'm not interested in having a generic
solution for SB_FREEZE_WRITE and a custom, per filesystem solution
for SB_FREEZE_TRANS when the solution is exactly the same.

> Using sb_start_write() instead of m_active_trans won't be that easy because
> it can create A-A deadlocks (e.g. we do sb_start_write in
> block_page_mkwrite() and then xfs_get_blocks() decides to start a
> transaction and calls sb_start_write() again which might block if
> filesystem freezing started in the mean time).

So, like Eric said in his first email, it's not a "write start/end"
interface that is needed, the interface has to work with different
freeze levels (e.g "sb_freeze_ref(sb, level)/sb_freeze_drop(sb,
level)").  Sure, internally it would have to map to two counters and
different level checks, but it solves the same problem for all
levels of freeze for all filesystems.

Let's fix this freeze problem once and for all in the generic code,
and not have to keep coming back to it to add more functioanlity for
different situations the most recent fix didn't handle for random
filesystem X....

> So it's up to XFS maintainers to decide what's best but I'd take
> Christoph's patch with above fixup. I guess I'll put it in this series and
> see what people say.

Eric and I have already discussed and agreed to replacing the XFS
sepcific code with the fixed VFS level API where other XFS
developers including the "XFS Maintainers" (*) can see. Nobody has
objected so I doubt there's any problem with doing so.

Besides, anything that replaces custom XFS code with a better
generic solution is pretty much guaranteed to be done. And given
that this is not an XFS specifc problem and it needs be fixed at
the VFS level.....

Cheers,

Dave.

[*] keep in mind that "XFS Maintainer" is just a figurehead who
maintains the tree that is sent to Linus, not the person with final
say over what changes are made. That decision is made by the
reviewers of the code...

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/4] fs: Improve filesystem freezing handling
  2012-01-12  1:20 ` [PATCH 1/4] fs: Improve filesystem freezing handling Jan Kara
  2012-01-12 19:53   ` Andreas Dilger
  2012-01-12 22:57   ` Eric Sandeen
@ 2012-01-13  1:26   ` Dave Chinner
  2012-01-13 10:12     ` Jan Kara
  2 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2012-01-13  1:26 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Surbhi Palande, Kamal Mostafa, Eric Sandeen, LKML,
	xfs, Christoph Hellwig, Dave Chinner, linux-ext4

On Thu, Jan 12, 2012 at 02:20:50AM +0100, Jan Kara wrote:
> + *
> + * Decrement number of writers to the filesystem and wake up possible
> + * waiters wanting to freeze the filesystem.
> + */
> +void sb_end_write(struct super_block *sb)
> +{
> +#ifdef CONFIG_SMP
> +	this_cpu_dec(sb->s_writers);
> +#else
> +	preempt_disable();
> +	sb->s_writers--;
> +	preempt_enable();
> +#endif

I really dislike this type of open coded per-cpu counter
implementation. I can't see that there is no good reason to use it
over percpu_counters here which abstract all this mess away.

i.e. it is relatively rare that the per-cpu count will nest
greater than the percpu_counter batch size (needs more than 32
concurrent blocked active writes per CPU), so there is no
significant overhead to using the percpu_counters here.

Indeed, if there are that many blocked writes per CPU, then the
overhead of an occasional global counter update is going to be lost
in the noise of everything else that is going on.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/4] fs: Improve filesystem freezing handling
  2012-01-13  1:26   ` Dave Chinner
@ 2012-01-13 10:12     ` Jan Kara
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2012-01-13 10:12 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-fsdevel, Surbhi Palande, Kamal Mostafa,
	Eric Sandeen, LKML, xfs, Christoph Hellwig, Dave Chinner,
	linux-ext4

On Fri 13-01-12 12:26:43, Dave Chinner wrote:
> On Thu, Jan 12, 2012 at 02:20:50AM +0100, Jan Kara wrote:
> > + *
> > + * Decrement number of writers to the filesystem and wake up possible
> > + * waiters wanting to freeze the filesystem.
> > + */
> > +void sb_end_write(struct super_block *sb)
> > +{
> > +#ifdef CONFIG_SMP
> > +	this_cpu_dec(sb->s_writers);
> > +#else
> > +	preempt_disable();
> > +	sb->s_writers--;
> > +	preempt_enable();
> > +#endif
> 
> I really dislike this type of open coded per-cpu counter
> implementation. I can't see that there is no good reason to use it
> over percpu_counters here which abstract all this mess away.
> 
> i.e. it is relatively rare that the per-cpu count will nest
> greater than the percpu_counter batch size (needs more than 32
> concurrent blocked active writes per CPU), so there is no
> significant overhead to using the percpu_counters here.
> 
> Indeed, if there are that many blocked writes per CPU, then the
> overhead of an occasional global counter update is going to be lost
> in the noise of everything else that is going on.
  Well, I just did it the way mnt_want_write / mnt_put_write does it. But
you are right that it's unnecessary so it's a good idea to switch the code
to using per-cpu counters. Thanks for the idea.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] Fix filesystem freezing
  2012-01-13  0:09     ` Dave Chinner
@ 2012-01-13 11:07       ` Jan Kara
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2012-01-13 11:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-fsdevel, LKML, linux-ext4, xfs, Eric Sandeen,
	Dave Chinner, Surbhi Palande, Kamal Mostafa, Christoph Hellwig

On Fri 13-01-12 11:09:32, Dave Chinner wrote:
> On Thu, Jan 12, 2012 at 12:30:31PM +0100, Jan Kara wrote:
> > On Thu 12-01-12 13:48:41, Dave Chinner wrote:
> > > On Thu, Jan 12, 2012 at 02:20:49AM +0100, Jan Kara wrote:
> > > > 
> > > >   Hello,
> > > > 
> > > >   filesystem freezing is currently racy and thus we can end up with dirty data
> > > > on frozen filesystem (see changelog of the first patch for detailed race
> > > > description and proposed fix). This patch series aims at fixing this.
> > > 
> > > It only fixes the dirty data race (i.e. SB_FREEZE_WRITE). The same
> > > race conditions exist for SB_FREEZE_TRANS on XFS, and so need the
> > > same fix. That race has had one previous attempt at fixing it in
> > > XFS but that's not possible:
> > > 
> > > b2ce397 Revert "xfs: fix filesystsem freeze race in xfs_trans_alloc"
> > > 7a249cf xfs: fix filesystsem freeze race in xfs_trans_alloc
> > > 
> > > It was looking at that problem earlier today that lead to the
> > > solution Eric proposed. Essentially the method in these patches
> > > needs to replace the xfs specifc m_active_trans counter and delay
> > > during ->fs_freeze to prevent that race condition....
> >   OK, I see. I just checked ext4 to make sure and ext4 seems to get this
> > right. Looking into Christoph's original patch it shouldn't be hard to fix
> > it. Instead of:
> >         atomic_inc(&mp->m_active_trans);
> >  
> >         if (wait_for_freeze)
> >               xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> > 
> > we just need to do a bit more elaborate
> > 
> > retry:
> >         if (wait_for_freeze)
> >               xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> >         atomic_inc(&mp->m_active_trans);
> > 	if (wait_for_freeze && mp->m_super->s_frozen >= SB_FREEZE_TRANS) {
> >         	atomic_dec(&mp->m_active_trans);
> > 		goto retry;
> > 	}
> > 
> > Or does XFS support nested transactions (i.e. a thread already holding a
> > running transaction can call into xfs_trans_alloc() again)?
> > That would make things more complicated...
> 
> You're still missing the point - that this isn't an XFS specific
> problem or that the write problem is a ext4 specific problem. The
> problem is that these are freeze state transition problems -
> something that can affect every filesystem because the freeze code
> is generic.  Quite frankly, I'm not interested in having a generic
> solution for SB_FREEZE_WRITE and a custom, per filesystem solution
> for SB_FREEZE_TRANS when the solution is exactly the same.
  I understand that both state transitions are currently racy. Just ext3,
ext4, reiserfs, gfs2, or btrfs do not really care about SB_FREEZE_TRANS
transition because they all grew their own synchronization mechanisms for
that. XFS is the only filesystem I know of which really relies on this
transition. That's why I originally decided to fixup SB_FREEZE_TRANS
transition only in XFS and not in VFS. But on a second thought I tend to
agree with you that VFS should provide a way to do race-free transition to
both states so that filesystems that want to use it can use it. So I'll add
a second counter for that.
 
> > Using sb_start_write() instead of m_active_trans won't be that easy because
> > it can create A-A deadlocks (e.g. we do sb_start_write in
> > block_page_mkwrite() and then xfs_get_blocks() decides to start a
> > transaction and calls sb_start_write() again which might block if
> > filesystem freezing started in the mean time).
> 
> So, like Eric said in his first email, it's not a "write start/end"
> interface that is needed, the interface has to work with different
> freeze levels (e.g "sb_freeze_ref(sb, level)/sb_freeze_drop(sb,
> level)").  Sure, internally it would have to map to two counters and
> different level checks, but it solves the same problem for all
> levels of freeze for all filesystems.
> 
> Let's fix this freeze problem once and for all in the generic code,
> and not have to keep coming back to it to add more functioanlity for
> different situations the most recent fix didn't handle for random
> filesystem X....
  Yeah. I think ext3/4 could be converted to the generic mechanism
(although it won't be completely trivial since it uses the internal
mechanism also for other things than filesystem freezing).
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2012-01-13 11:08 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-12  1:20 [PATCH 0/4] Fix filesystem freezing Jan Kara
2012-01-12  1:20 ` [PATCH 1/4] fs: Improve filesystem freezing handling Jan Kara
2012-01-12 19:53   ` Andreas Dilger
2012-01-12 20:07     ` Jan Kara
2012-01-12 22:57   ` Eric Sandeen
2012-01-12 23:15     ` Jan Kara
2012-01-13  1:26   ` Dave Chinner
2012-01-13 10:12     ` Jan Kara
2012-01-12  1:20 ` [PATCH 2/4] vfs: Protect write paths by sb_start_write - sb_end_write Jan Kara
2012-01-12 19:56   ` Andreas Dilger
2012-01-12 20:11     ` Jan Kara
2012-01-12  1:20 ` [PATCH 3/4] ext4: Protect ext4_page_mkwrite with " Jan Kara
2012-01-12  1:20 ` [PATCH 4/4] xfs: Protect xfs_file_aio_write() " Jan Kara
2012-01-12 21:29   ` Al Viro
2012-01-12 21:36     ` Jan Kara
2012-01-12  2:48 ` [PATCH 0/4] Fix filesystem freezing Dave Chinner
2012-01-12 11:30   ` Jan Kara
2012-01-13  0:09     ` Dave Chinner
2012-01-13 11:07       ` Jan Kara
2012-01-12 20:48 ` Ted Ts'o
2012-01-12 21:38   ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).