All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3] vfs: redesign sb state flags
@ 2010-02-27  0:25 Dmitry Monakhov
  2010-02-27  0:25 ` [PATCH 2/3] vfs: per-sb write count preparation stage Dmitry Monakhov
  0 siblings, 1 reply; 10+ messages in thread
From: Dmitry Monakhov @ 2010-02-27  0:25 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: viro, Dmitry Monakhov

Super block has special variable for fssync tracking. This is not
optimal. Let's rename it to general s_state variable.
Other bits will be used in later patches.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/sync.c          |    6 +++---
 include/linux/fs.h |    5 ++++-
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/sync.c b/fs/sync.c
index 418727a..6e540fc 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -97,13 +97,13 @@ static void sync_filesystems(int wait)
 	mutex_lock(&mutex);		/* Could be down_interruptible */
 	spin_lock(&sb_lock);
 	list_for_each_entry(sb, &super_blocks, s_list)
-		sb->s_need_sync = 1;
+		set_bit(ST_FS_SYNC, &sb->s_state);
 
 restart:
 	list_for_each_entry(sb, &super_blocks, s_list) {
-		if (!sb->s_need_sync)
+		if (!test_bit(ST_FS_SYNC, &sb->s_state))
 			continue;
-		sb->s_need_sync = 0;
+		clear_bit(ST_FS_SYNC, &sb->s_state);
 		sb->s_count++;
 		spin_unlock(&sb_lock);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index cd32f1b..66369f6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1306,6 +1306,9 @@ extern int send_sigurg(struct fown_struct *fown);
 #define MNT_DETACH	0x00000002	/* Just detach from the tree */
 #define MNT_EXPIRE	0x00000004	/* Mark for expiry */
 
+enum {
+	ST_FS_SYNC, 	/* fssync is about to happen */
+};
 extern struct list_head super_blocks;
 extern spinlock_t sb_lock;
 
@@ -1325,11 +1328,11 @@ struct super_block {
 	const struct export_operations *s_export_op;
 	unsigned long		s_flags;
 	unsigned long		s_magic;
+	unsigned long		s_state;
 	struct dentry		*s_root;
 	struct rw_semaphore	s_umount;
 	struct mutex		s_lock;
 	int			s_count;
-	int			s_need_sync;
 	atomic_t		s_active;
 #ifdef CONFIG_SECURITY
 	void                    *s_security;
-- 
1.6.6


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/3] vfs: per-sb write count preparation stage
  2010-02-27  0:25 [PATCH 1/3] vfs: redesign sb state flags Dmitry Monakhov
@ 2010-02-27  0:25 ` Dmitry Monakhov
  2010-02-27  0:25   ` [PATCH 3/3] vfs: fix filesystem_sync vs write race on rw=>ro remount v2 Dmitry Monakhov
  0 siblings, 1 reply; 10+ messages in thread
From: Dmitry Monakhov @ 2010-02-27  0:25 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: viro, Dmitry Monakhov

Current fs_may_remount_ro() implementation is suboptimal.
Sane implementation must have some sort of per-sb dirty counter.
We may implement it exactly like mnt write counter. But this
introduce additional code on fast path. Let do not reinvent the
weel. And just caclulate sb write count as sum of write_count
of it's vfsmnts.
To achieve this task we have to link vfsmnt in to per-sb list.

Later patch introduce actual dirty write counter implementation.

NOTE: The problem with per-sb list is than external modules may
not know about this new logic. Module may just forget to insert
mnt in to sb's list. This condition is hard to guard by some sort
of BUG_ON. So it's list will be empty and write_count will result
incorrect result.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/btrfs/super.c      |    2 +-
 fs/gfs2/ops_fstype.c  |    4 ++--
 fs/namespace.c        |   23 +++++++++++++++++++++--
 fs/nfs/super.c        |   12 ++++++------
 fs/super.c            |    1 +
 include/linux/fs.h    |    3 +++
 include/linux/mount.h |    1 +
 7 files changed, 35 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 8a1ea6e..f82b0ad 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -567,7 +567,7 @@ static int btrfs_get_sb(struct file_system_type *fs_type, int flags,
 		}
 	}
 
-	mnt->mnt_sb = s;
+	super_add_mnt(s, mnt);
 	mnt->mnt_root = root;
 
 	kfree(subvol_name);
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 8a102f7..f0d71a0 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -1361,7 +1361,7 @@ static int gfs2_get_sb(struct file_system_type *fs_type, int flags,
 	}
 
 	sdp = s->s_fs_info;
-	mnt->mnt_sb = s;
+	super_add_mnt(sb, mnt);
 	if (args.ar_meta)
 		mnt->mnt_root = dget(sdp->sd_master_dir);
 	else
@@ -1406,7 +1406,7 @@ static int gfs2_get_sb_meta(struct file_system_type *fs_type, int flags,
 		return -EBUSY;
 	}
 	sdp = s->s_fs_info;
-	mnt->mnt_sb = s;
+	super_add_mnt(s, mnt);
 	mnt->mnt_root = dget(sdp->sd_master_dir);
 	return 0;
 }
diff --git a/fs/namespace.c b/fs/namespace.c
index ffa3843..e816097 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -145,6 +145,7 @@ struct vfsmount *alloc_vfsmnt(const char *name)
 		INIT_LIST_HEAD(&mnt->mnt_hash);
 		INIT_LIST_HEAD(&mnt->mnt_child);
 		INIT_LIST_HEAD(&mnt->mnt_mounts);
+		INIT_LIST_HEAD(&mnt->mnt_sb_list);
 		INIT_LIST_HEAD(&mnt->mnt_list);
 		INIT_LIST_HEAD(&mnt->mnt_expire);
 		INIT_LIST_HEAD(&mnt->mnt_share);
@@ -389,9 +390,26 @@ static void __mnt_unmake_readonly(struct vfsmount *mnt)
 	spin_unlock(&vfsmount_lock);
 }
 
-void simple_set_mnt(struct vfsmount *mnt, struct super_block *sb)
+void super_add_mnt(struct super_block *sb, struct vfsmount *mnt)
 {
+	spin_lock(&vfsmount_lock);
+	list_add(&sb->s_vfsmount, &mnt->mnt_sb_list);
 	mnt->mnt_sb = sb;
+	spin_unlock(&vfsmount_lock);
+}
+EXPORT_SYMBOL(super_add_mnt);
+
+void super_del_mnt(struct vfsmount *mnt)
+{
+	spin_lock(&vfsmount_lock);
+	list_del_init(&mnt->mnt_sb_list);
+	spin_unlock(&vfsmount_lock);
+}
+EXPORT_SYMBOL(super_del_mnt);
+
+void simple_set_mnt(struct vfsmount *mnt, struct super_block *sb)
+{
+	super_add_mnt(sb, mnt);
 	mnt->mnt_root = dget(sb->s_root);
 }
 
@@ -563,7 +581,7 @@ static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root,
 
 		mnt->mnt_flags = old->mnt_flags;
 		atomic_inc(&sb->s_active);
-		mnt->mnt_sb = sb;
+		super_add_mnt(sb, mnt);
 		mnt->mnt_root = dget(root);
 		mnt->mnt_mountpoint = mnt->mnt_root;
 		mnt->mnt_parent = mnt;
@@ -997,6 +1015,7 @@ void release_mounts(struct list_head *head)
 	while (!list_empty(head)) {
 		mnt = list_first_entry(head, struct vfsmount, mnt_hash);
 		list_del_init(&mnt->mnt_hash);
+		super_del_mnt(mnt);
 		if (mnt->mnt_parent != mnt) {
 			struct dentry *dentry;
 			struct vfsmount *m;
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index f1afee4..3470201 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -2235,7 +2235,7 @@ static int nfs_get_sb(struct file_system_type *fs_type,
 		goto error_splat_root;
 
 	s->s_flags |= MS_ACTIVE;
-	mnt->mnt_sb = s;
+	super_add_mnt(s, mnt);
 	mnt->mnt_root = mntroot;
 	error = 0;
 
@@ -2347,7 +2347,7 @@ static int nfs_xdev_get_sb(struct file_system_type *fs_type, int flags,
 	}
 
 	s->s_flags |= MS_ACTIVE;
-	mnt->mnt_sb = s;
+	super_add_mnt(s, mnt);
 	mnt->mnt_root = mntroot;
 
 	/* clone any lsm security options from the parent to the new sb */
@@ -2599,7 +2599,7 @@ static int nfs4_remote_get_sb(struct file_system_type *fs_type,
 		goto error_splat_root;
 
 	s->s_flags |= MS_ACTIVE;
-	mnt->mnt_sb = s;
+	super_add_mnt(s, mnt);
 	mnt->mnt_root = mntroot;
 	error = 0;
 
@@ -2681,7 +2681,7 @@ static int nfs_follow_remote_path(struct vfsmount *root_mnt,
 
 	s = nd.path.mnt->mnt_sb;
 	atomic_inc(&s->s_active);
-	mnt_target->mnt_sb = s;
+	super_add_mnt(s, mnt);
 	mnt_target->mnt_root = dget(nd.path.dentry);
 
 	/* Correct the device pathname */
@@ -2832,7 +2832,7 @@ static int nfs4_xdev_get_sb(struct file_system_type *fs_type, int flags,
 	}
 
 	s->s_flags |= MS_ACTIVE;
-	mnt->mnt_sb = s;
+	super_add_mnt(s, mnt);
 	mnt->mnt_root = mntroot;
 
 	security_sb_clone_mnt_opts(data->sb, s);
@@ -2914,7 +2914,7 @@ static int nfs4_remote_referral_get_sb(struct file_system_type *fs_type,
 	}
 
 	s->s_flags |= MS_ACTIVE;
-	mnt->mnt_sb = s;
+	super_add_mnt(s, mnt);
 	mnt->mnt_root = mntroot;
 
 	security_sb_clone_mnt_opts(data->sb, s);
diff --git a/fs/super.c b/fs/super.c
index f35ac60..d3e0083 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -151,6 +151,7 @@ int __put_super_and_need_restart(struct super_block *sb)
 {
 	/* check for race with generic_shutdown_super() */
 	if (list_empty(&sb->s_list)) {
+		WARN_ON(!list_empty(&sb->s_vfsmount));
 		/* super block is removed, need to restart... */
 		__put_super(sb);
 		return 1;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 66369f6..75f057d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1339,6 +1339,7 @@ struct super_block {
 #endif
 	struct xattr_handler	**s_xattr;
 
+	struct list_head	s_vfsmount;	/* All vfsmounts */
 	struct list_head	s_inodes;	/* all inodes */
 	struct hlist_head	s_anon;		/* anonymous dentries for (nfs) exporting */
 	struct list_head	s_files;
@@ -1779,6 +1780,8 @@ extern int get_sb_pseudo(struct file_system_type *, char *,
 	const struct super_operations *ops, unsigned long,
 	struct vfsmount *mnt);
 extern void simple_set_mnt(struct vfsmount *mnt, struct super_block *sb);
+extern void super_add_mnt(struct super_block *sb, struct vfsmount *mnt);
+extern void super_del_mnt(struct vfsmount *mnt);
 int __put_super_and_need_restart(struct super_block *sb);
 void put_super(struct super_block *sb);
 
diff --git a/include/linux/mount.h b/include/linux/mount.h
index ca726eb..751784c 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -59,6 +59,7 @@ struct vfsmount {
 	/* 4 bytes hole on 64bits arches */
 	const char *mnt_devname;	/* Name of device e.g. /dev/dsk/hda1 */
 	struct list_head mnt_list;
+	struct list_head mnt_sb_list;	/* link sb-speciffic mounts */
 	struct list_head mnt_expire;	/* link in fs-specific expiry list */
 	struct list_head mnt_share;	/* circular list of shared mounts */
 	struct list_head mnt_slave_list;/* list of slave mounts */
-- 
1.6.6


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/3]      vfs: fix filesystem_sync vs write race on rw=>ro remount v2
  2010-02-27  0:25 ` [PATCH 2/3] vfs: per-sb write count preparation stage Dmitry Monakhov
@ 2010-02-27  0:25   ` Dmitry Monakhov
  2010-03-02 13:29     ` Jan Kara
  2010-03-02 15:21     ` Nick Piggin
  0 siblings, 2 replies; 10+ messages in thread
From: Dmitry Monakhov @ 2010-02-27  0:25 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: viro, Dmitry Monakhov

Currently on rw=>ro remount we have following race
| mount /mnt -oremount,ro | write-task |
|-------------------------+------------|
|                         | open(RDWR) |
| shrink_dcache_sb(sb);   |            |
| sync_filesystem(sb);    |            |
|                         | write()    |
|                         | close()    |
| fs_may_remount_ro(sb)   |            |
| sb->s_flags = new_flags |            |

Later writeback or sync() will result in error due to MS_RDONLY flag
In case of ext4 this result in jbd2_start failure on writeback
LOG: ext4_da_writepages: jbd2_start: 1024 pages, ino 1431; err -30

This patch fix the issue like follows:
1) Introduce ST_REMOUNT_RO bit with will be set on remount start
   This bit will be cleared if new writers appear.
2) Redesign fs_may_remount_ro. Now it is just calculate sum of
   corresponding vfsmounts
3) The rest is simple, we just perform sync and check than no new
   writers appear.

##TESTCASE_BEGIN:
#! /bin/bash -x
DEV=/dev/sdb5
FSTYPE=ext4
BINDIR=/home/dmon
MNTOPT="data=ordered"
umount /mnt
mkfs.${FSTYPE}  ${DEV} || exit 1
mount  ${DEV} /mnt -o${MNTOPT} || exit 1
${BINDIR}/fsstress -p1 -l999999999 -n9999999999 -d /mnt/test &
sleep 15
mount /mnt -oremount,ro,${MNTOPT}
sleep 1
killall -9 fsstress
sync
# after this you may get following message in dmesg
# "ext4_da_writepages: jbd2_start: 1024 pages, ino 1431; err -30"
##TESTCASE_END

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/file_table.c    |   24 ------------------------
 fs/namespace.c     |   48 ++++++++++++++++++++++++++++++++++++++++++++----
 fs/super.c         |   33 +++++++++++++++++++++++++++------
 include/linux/fs.h |    1 +
 4 files changed, 72 insertions(+), 34 deletions(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index b98404b..32d37af 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -348,30 +348,6 @@ void file_kill(struct file *file)
 	}
 }
 
-int fs_may_remount_ro(struct super_block *sb)
-{
-	struct file *file;
-
-	/* Check that no files are currently opened for writing. */
-	file_list_lock();
-	list_for_each_entry(file, &sb->s_files, f_u.fu_list) {
-		struct inode *inode = file->f_path.dentry->d_inode;
-
-		/* File with pending delete? */
-		if (inode->i_nlink == 0)
-			goto too_bad;
-
-		/* Writeable file? */
-		if (S_ISREG(inode->i_mode) && (file->f_mode & FMODE_WRITE))
-			goto too_bad;
-	}
-	file_list_unlock();
-	return 1; /* Tis' cool bro. */
-too_bad:
-	file_list_unlock();
-	return 0;
-}
-
 /**
  *	mark_files_ro - mark all files read-only
  *	@sb: superblock in question
diff --git a/fs/namespace.c b/fs/namespace.c
index e816097..bc79f1f 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -277,6 +277,13 @@ int mnt_want_write(struct vfsmount *mnt)
 		dec_mnt_writers(mnt);
 		ret = -EROFS;
 		goto out;
+	} else {
+		/*
+		 * Clear ST_REMOUNT_RO flag to let remount task know
+		 * about new writers.
+		 */
+		if (unlikely(test_bit(ST_REMOUNT_RO, &mnt->mnt_sb->s_state)))
+			clear_bit(ST_REMOUNT_RO, &mnt->mnt_sb->s_state);
 	}
 out:
 	preempt_enable();
@@ -341,11 +348,10 @@ void mnt_drop_write(struct vfsmount *mnt)
 }
 EXPORT_SYMBOL_GPL(mnt_drop_write);
 
-static int mnt_make_readonly(struct vfsmount *mnt)
+static int mnt_check_writers(struct vfsmount *mnt, int set_ro)
 {
 	int ret = 0;
 
-	spin_lock(&vfsmount_lock);
 	mnt->mnt_flags |= MNT_WRITE_HOLD;
 	/*
 	 * After storing MNT_WRITE_HOLD, we'll read the counters. This store
@@ -371,14 +377,23 @@ static int mnt_make_readonly(struct vfsmount *mnt)
 	 */
 	if (count_mnt_writers(mnt) > 0)
 		ret = -EBUSY;
-	else
-		mnt->mnt_flags |= MNT_READONLY;
+	else {
+		if (likely(set_ro))
+			mnt->mnt_flags |= MNT_READONLY;
+	}
 	/*
 	 * MNT_READONLY must become visible before ~MNT_WRITE_HOLD, so writers
 	 * that become unheld will see MNT_READONLY.
 	 */
 	smp_wmb();
 	mnt->mnt_flags &= ~MNT_WRITE_HOLD;
+	return ret;
+}
+static int mnt_make_readonly(struct vfsmount *mnt)
+{
+	int ret;
+	spin_lock(&vfsmount_lock);
+	ret = mnt_check_writers(mnt, 1);
 	spin_unlock(&vfsmount_lock);
 	return ret;
 }
@@ -389,6 +404,31 @@ static void __mnt_unmake_readonly(struct vfsmount *mnt)
 	mnt->mnt_flags &= ~MNT_READONLY;
 	spin_unlock(&vfsmount_lock);
 }
+/**
+ * Check whenever is it possible to remount given sb to readonly.
+ * @sb : super block in question
+ *
+ * Caller is responsible to set ST_REMOUNT_RO state before the call.
+ */
+int fs_may_remount_ro(struct super_block *sb)
+{
+	struct vfsmount *mnt;
+	int ret = 1;
+	spin_lock(&vfsmount_lock);
+	list_for_each_entry(mnt, &sb->s_vfsmount, mnt_sb_list) {
+		ret = !mnt_check_writers(mnt, 0);
+		if (ret)
+			break;
+	}
+	spin_unlock(&vfsmount_lock);
+	/*
+	 * If new writer appears after we have checked all vfsmounts.
+	 * Then ST_REMOUNT_RO bit will be cleared.
+	 */
+	if (!test_bit(ST_REMOUNT_RO, &sb->s_state))
+		ret = 0;
+	return ret;
+}
 
 void super_add_mnt(struct super_block *sb, struct vfsmount *mnt)
 {
diff --git a/fs/super.c b/fs/super.c
index d3e0083..8cf148b 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -571,6 +571,9 @@ int do_remount_sb(struct super_block *sb, int flags, void *data, int force)
 	int retval;
 	int remount_rw, remount_ro;
 
+	remount_ro = (flags & MS_RDONLY) && !(sb->s_flags & MS_RDONLY);
+	remount_rw = !(flags & MS_RDONLY) && (sb->s_flags & MS_RDONLY);
+
 	if (sb->s_frozen != SB_UNFROZEN)
 		return -EBUSY;
 
@@ -581,18 +584,33 @@ int do_remount_sb(struct super_block *sb, int flags, void *data, int force)
 
 	if (flags & MS_RDONLY)
 		acct_auto_close(sb);
+	if (remount_ro) {
+		if (force)
+			mark_files_ro(sb);
+		/*
+		 * Store ST_REMOUNT_RO flag. New writers (in any) will clrear
+		 * this bit.
+		 */
+		set_bit(ST_REMOUNT_RO, &sb->s_state);
+		/*
+		 * This store should be visible before we do.
+		 */
+		smp_mb();
+		/*
+		 * To prevent sync vs write race condition we have to check
+		 * writers before filesystem sync.
+		 */
+		if (!fs_may_remount_ro(sb))
+			return -EBUSY;
+	}
 	shrink_dcache_sb(sb);
 	sync_filesystem(sb);
 
-	remount_ro = (flags & MS_RDONLY) && !(sb->s_flags & MS_RDONLY);
-	remount_rw = !(flags & MS_RDONLY) && (sb->s_flags & MS_RDONLY);
 
 	/* If we are remounting RDONLY and current sb is read/write,
-	   make sure there are no rw files opened */
+	   make sure there are no new rw files opened */
 	if (remount_ro) {
-		if (force)
-			mark_files_ro(sb);
-		else if (!fs_may_remount_ro(sb))
+		if (!fs_may_remount_ro(sb))
 			return -EBUSY;
 		retval = vfs_dq_off(sb, 1);
 		if (retval < 0 && retval != -ENOSYS)
@@ -617,6 +635,9 @@ int do_remount_sb(struct super_block *sb, int flags, void *data, int force)
 	 */
 	if (remount_ro && sb->s_bdev)
 		invalidate_bdev(sb->s_bdev);
+
+	WARN_ON(remount_ro && !fs_may_remount_ro(sb));
+	clear_bit(ST_REMOUNT_RO, &sb->s_state);
 	return 0;
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 75f057d..0ad7aa4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1308,6 +1308,7 @@ extern int send_sigurg(struct fown_struct *fown);
 
 enum {
 	ST_FS_SYNC, 	/* fssync is about to happen */
+	ST_REMOUNT_RO,	/* readonly remount is in progress */
 };
 extern struct list_head super_blocks;
 extern spinlock_t sb_lock;
-- 
1.6.6


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3]      vfs: fix filesystem_sync vs write race on rw=>ro remount v2
  2010-02-27  0:25   ` [PATCH 3/3] vfs: fix filesystem_sync vs write race on rw=>ro remount v2 Dmitry Monakhov
@ 2010-03-02 13:29     ` Jan Kara
  2010-03-02 14:04       ` Dmitry Monakhov
  2010-03-02 15:21     ` Nick Piggin
  1 sibling, 1 reply; 10+ messages in thread
From: Jan Kara @ 2010-03-02 13:29 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-fsdevel, viro

On Sat 27-02-10 03:25:49, Dmitry Monakhov wrote:
> +/**
> + * Check whenever is it possible to remount given sb to readonly.
> + * @sb : super block in question
> + *
> + * Caller is responsible to set ST_REMOUNT_RO state before the call.
> + */
> +int fs_may_remount_ro(struct super_block *sb)
> +{
> +	struct vfsmount *mnt;
> +	int ret = 1;
> +	spin_lock(&vfsmount_lock);
> +	list_for_each_entry(mnt, &sb->s_vfsmount, mnt_sb_list) {
> +		ret = !mnt_check_writers(mnt, 0);
> +		if (ret)
> +			break;
> +	}
> +	spin_unlock(&vfsmount_lock);
> +	/*
> +	 * If new writer appears after we have checked all vfsmounts.
> +	 * Then ST_REMOUNT_RO bit will be cleared.
> +	 */
> +	if (!test_bit(ST_REMOUNT_RO, &sb->s_state))
> +		ret = 0;
> +	return ret;
> +}
  This misses the case when the superblock as unlinked-but-open files.
In such case we have to fail remount RO as well. The original
fs_may_remount_ro checks for that..

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3]      vfs: fix filesystem_sync vs write race on rw=>ro remount v2
  2010-03-02 13:29     ` Jan Kara
@ 2010-03-02 14:04       ` Dmitry Monakhov
  2010-03-02 14:06         ` Jan Kara
  0 siblings, 1 reply; 10+ messages in thread
From: Dmitry Monakhov @ 2010-03-02 14:04 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, viro

Jan Kara <jack@suse.cz> writes:

> On Sat 27-02-10 03:25:49, Dmitry Monakhov wrote:
>> +/**
>> + * Check whenever is it possible to remount given sb to readonly.
>> + * @sb : super block in question
>> + *
>> + * Caller is responsible to set ST_REMOUNT_RO state before the call.
>> + */
>> +int fs_may_remount_ro(struct super_block *sb)
>> +{
>> +	struct vfsmount *mnt;
>> +	int ret = 1;
>> +	spin_lock(&vfsmount_lock);
>> +	list_for_each_entry(mnt, &sb->s_vfsmount, mnt_sb_list) {
>> +		ret = !mnt_check_writers(mnt, 0);
>> +		if (ret)
>> +			break;
>> +	}
>> +	spin_unlock(&vfsmount_lock);
>> +	/*
>> +	 * If new writer appears after we have checked all vfsmounts.
>> +	 * Then ST_REMOUNT_RO bit will be cleared.
>> +	 */
>> +	if (!test_bit(ST_REMOUNT_RO, &sb->s_state))
>> +		ret = 0;
>> +	return ret;
>> +}
>   This misses the case when the superblock as unlinked-but-open files.
> In such case we have to fail remount RO as well. The original
> fs_may_remount_ro checks for that..
Since file is opened for write one of vfsmnt struct would have non
zero write count. So -EBUSY will be returned from fs_may_remount_ro()
>
> 								Honza

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3]      vfs: fix filesystem_sync vs write race on rw=>ro remount v2
  2010-03-02 14:04       ` Dmitry Monakhov
@ 2010-03-02 14:06         ` Jan Kara
  2010-03-02 14:24           ` Dmitry Monakhov
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Kara @ 2010-03-02 14:06 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Jan Kara, linux-fsdevel, viro

On Tue 02-03-10 17:04:26, Dmitry Monakhov wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> > On Sat 27-02-10 03:25:49, Dmitry Monakhov wrote:
> >> +/**
> >> + * Check whenever is it possible to remount given sb to readonly.
> >> + * @sb : super block in question
> >> + *
> >> + * Caller is responsible to set ST_REMOUNT_RO state before the call.
> >> + */
> >> +int fs_may_remount_ro(struct super_block *sb)
> >> +{
> >> +	struct vfsmount *mnt;
> >> +	int ret = 1;
> >> +	spin_lock(&vfsmount_lock);
> >> +	list_for_each_entry(mnt, &sb->s_vfsmount, mnt_sb_list) {
> >> +		ret = !mnt_check_writers(mnt, 0);
> >> +		if (ret)
> >> +			break;
> >> +	}
> >> +	spin_unlock(&vfsmount_lock);
> >> +	/*
> >> +	 * If new writer appears after we have checked all vfsmounts.
> >> +	 * Then ST_REMOUNT_RO bit will be cleared.
> >> +	 */
> >> +	if (!test_bit(ST_REMOUNT_RO, &sb->s_state))
> >> +		ret = 0;
> >> +	return ret;
> >> +}
> >   This misses the case when the superblock as unlinked-but-open files.
> > In such case we have to fail remount RO as well. The original
> > fs_may_remount_ro checks for that..
> Since file is opened for write one of vfsmnt struct would have non
> zero write count. So -EBUSY will be returned from fs_may_remount_ro()
  But file can be open for reading only...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3]      vfs: fix filesystem_sync vs write race on rw=>ro remount v2
  2010-03-02 14:06         ` Jan Kara
@ 2010-03-02 14:24           ` Dmitry Monakhov
  2010-03-02 14:35             ` Jan Kara
  0 siblings, 1 reply; 10+ messages in thread
From: Dmitry Monakhov @ 2010-03-02 14:24 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, viro

Jan Kara <jack@suse.cz> writes:

> On Tue 02-03-10 17:04:26, Dmitry Monakhov wrote:
>> Jan Kara <jack@suse.cz> writes:
>> 
>> > On Sat 27-02-10 03:25:49, Dmitry Monakhov wrote:
>> >> +/**
>> >> + * Check whenever is it possible to remount given sb to readonly.
>> >> + * @sb : super block in question
>> >> + *
>> >> + * Caller is responsible to set ST_REMOUNT_RO state before the call.
>> >> + */
>> >> +int fs_may_remount_ro(struct super_block *sb)
>> >> +{
>> >> +	struct vfsmount *mnt;
>> >> +	int ret = 1;
>> >> +	spin_lock(&vfsmount_lock);
>> >> +	list_for_each_entry(mnt, &sb->s_vfsmount, mnt_sb_list) {
>> >> +		ret = !mnt_check_writers(mnt, 0);
>> >> +		if (ret)
>> >> +			break;
>> >> +	}
>> >> +	spin_unlock(&vfsmount_lock);
>> >> +	/*
>> >> +	 * If new writer appears after we have checked all vfsmounts.
>> >> +	 * Then ST_REMOUNT_RO bit will be cleared.
>> >> +	 */
>> >> +	if (!test_bit(ST_REMOUNT_RO, &sb->s_state))
>> >> +		ret = 0;
>> >> +	return ret;
>> >> +}
>> >   This misses the case when the superblock as unlinked-but-open files.
>> > In such case we have to fail remount RO as well. The original
>> > fs_may_remount_ro checks for that..
>> Since file is opened for write one of vfsmnt struct would have non
>> zero write count. So -EBUSY will be returned from fs_may_remount_ro()
>   But file can be open for reading only...
Ohh.. i see. what is the reason to fail RO remount due to unlinked
files? Is this because not all filesystem has orphan list?
Shame on such FS. Seems that i have to also grab write count
in unlink path if i_nlink becomes zero and drop on inode release.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3]      vfs: fix filesystem_sync vs write race on rw=>ro remount v2
  2010-03-02 14:24           ` Dmitry Monakhov
@ 2010-03-02 14:35             ` Jan Kara
  2010-03-02 15:38               ` Dmitry Monakhov
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Kara @ 2010-03-02 14:35 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Jan Kara, linux-fsdevel, viro

On Tue 02-03-10 17:24:36, Dmitry Monakhov wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> > On Tue 02-03-10 17:04:26, Dmitry Monakhov wrote:
> >> Jan Kara <jack@suse.cz> writes:
> >> 
> >> > On Sat 27-02-10 03:25:49, Dmitry Monakhov wrote:
> >> >> +/**
> >> >> + * Check whenever is it possible to remount given sb to readonly.
> >> >> + * @sb : super block in question
> >> >> + *
> >> >> + * Caller is responsible to set ST_REMOUNT_RO state before the call.
> >> >> + */
> >> >> +int fs_may_remount_ro(struct super_block *sb)
> >> >> +{
> >> >> +	struct vfsmount *mnt;
> >> >> +	int ret = 1;
> >> >> +	spin_lock(&vfsmount_lock);
> >> >> +	list_for_each_entry(mnt, &sb->s_vfsmount, mnt_sb_list) {
> >> >> +		ret = !mnt_check_writers(mnt, 0);
> >> >> +		if (ret)
> >> >> +			break;
> >> >> +	}
> >> >> +	spin_unlock(&vfsmount_lock);
> >> >> +	/*
> >> >> +	 * If new writer appears after we have checked all vfsmounts.
> >> >> +	 * Then ST_REMOUNT_RO bit will be cleared.
> >> >> +	 */
> >> >> +	if (!test_bit(ST_REMOUNT_RO, &sb->s_state))
> >> >> +		ret = 0;
> >> >> +	return ret;
> >> >> +}
> >> >   This misses the case when the superblock as unlinked-but-open files.
> >> > In such case we have to fail remount RO as well. The original
> >> > fs_may_remount_ro checks for that..
> >> Since file is opened for write one of vfsmnt struct would have non
> >> zero write count. So -EBUSY will be returned from fs_may_remount_ro()
> >   But file can be open for reading only...
> Ohh.. i see. what is the reason to fail RO remount due to unlinked
> files? Is this because not all filesystem has orphan list?
  Yes, this is the main reason.

> Shame on such FS. Seems that i have to also grab write count
> in unlink path if i_nlink becomes zero and drop on inode release.
  Yes, I'd imagine some solution like this as well.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3]      vfs: fix filesystem_sync vs write race on rw=>ro remount v2
  2010-02-27  0:25   ` [PATCH 3/3] vfs: fix filesystem_sync vs write race on rw=>ro remount v2 Dmitry Monakhov
  2010-03-02 13:29     ` Jan Kara
@ 2010-03-02 15:21     ` Nick Piggin
  1 sibling, 0 replies; 10+ messages in thread
From: Nick Piggin @ 2010-03-02 15:21 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-fsdevel, viro

On Sat, Feb 27, 2010 at 03:25:49AM +0300, Dmitry Monakhov wrote:
> diff --git a/fs/namespace.c b/fs/namespace.c
> index e816097..bc79f1f 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -277,6 +277,13 @@ int mnt_want_write(struct vfsmount *mnt)
>  		dec_mnt_writers(mnt);
>  		ret = -EROFS;
>  		goto out;
> +	} else {
> +		/*
> +		 * Clear ST_REMOUNT_RO flag to let remount task know
> +		 * about new writers.
> +		 */
> +		if (unlikely(test_bit(ST_REMOUNT_RO, &mnt->mnt_sb->s_state)))
> +			clear_bit(ST_REMOUNT_RO, &mnt->mnt_sb->s_state);
>  	}
>  out:
>  	preempt_enable();

I'm not totally sure you've covered the race here. The other side:

store s_state <- ST_REMOUNT_RO
smp_mb()
load  write count
spin_unlock()
load  s_state

So to start with, the load of s_state could be moved up above one
of the loads of write_count.

This side:
store  write count
smp_mb()
load   s_state
store  s_state

The memory orderings themselves should be OK, although you haven't
updated the existing smp_mb() comments to say that it now also orders
s_state manipulation with write count manipulation and hence also
couples with the new smp_mb you introduced[*].

However if you do this store/store versus load/load sequence for
synchronization (or load/store vs store/load), then you usually need to
do them in opposite order otherwise you lose synchronization: couldn't
write count be loaded in (1) before it is incremented in (2), then
s_state loaded in (1) before it is cleared in (2)?

[*] please explicitly comment all barriers (including acquire and release
barriers if you use them for something other than just the critical
sections). For *every* barrier site, comment or at least point to
comments that document all actors in the protocol.


> @@ -581,18 +584,33 @@ int do_remount_sb(struct super_block *sb, int flags, void *data, int force)
>  
>  	if (flags & MS_RDONLY)
>  		acct_auto_close(sb);
> +	if (remount_ro) {
> +		if (force)
> +			mark_files_ro(sb);

It's a pity we still need the files list for mark_files_ro.


> +		/*
> +		 * Store ST_REMOUNT_RO flag. New writers (in any) will clrear
> +		 * this bit.
> +		 */
> +		set_bit(ST_REMOUNT_RO, &sb->s_state);
> +		/*
> +		 * This store should be visible before we do.
> +		 */
> +		smp_mb();
> +		/*
> +		 * To prevent sync vs write race condition we have to check
> +		 * writers before filesystem sync.
> +		 */
> +		if (!fs_may_remount_ro(sb))
> +			return -EBUSY;

I guess in the 'force' case, you may actually want to try harder than
this.

I don't know how you can reasonably do that and prevent livelock,
though. Apart from doing this totally differently and putting a sleeping
lock into mnt_want_write, taken only in the slowpath when a remount
action is pending. This should make things simpler here but no idea
whether mnt_want_write is called from atomic context. 

Overall I don't know what to think of the direction you're taking here.
I guess it's probably viable, but OTOH if we were able to block in
mnt_want_write then we should be able to just eliminate all races
and make it work still by just traversing the files list. (I would
like it much more if mark_files_ro could go away at the same time,
but I don't see how).


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3]      vfs: fix filesystem_sync vs write race on rw=>ro remount v2
  2010-03-02 14:35             ` Jan Kara
@ 2010-03-02 15:38               ` Dmitry Monakhov
  0 siblings, 0 replies; 10+ messages in thread
From: Dmitry Monakhov @ 2010-03-02 15:38 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, viro

Jan Kara <jack@suse.cz> writes:

> On Tue 02-03-10 17:24:36, Dmitry Monakhov wrote:
>> Jan Kara <jack@suse.cz> writes:
>> 
>> > On Tue 02-03-10 17:04:26, Dmitry Monakhov wrote:
>> >> Jan Kara <jack@suse.cz> writes:
>> >> 
>> >> > On Sat 27-02-10 03:25:49, Dmitry Monakhov wrote:
>> >> >> +/**
>> >> >> + * Check whenever is it possible to remount given sb to readonly.
>> >> >> + * @sb : super block in question
>> >> >> + *
>> >> >> + * Caller is responsible to set ST_REMOUNT_RO state before the call.
>> >> >> + */
>> >> >> +int fs_may_remount_ro(struct super_block *sb)
>> >> >> +{
>> >> >> +	struct vfsmount *mnt;
>> >> >> +	int ret = 1;
>> >> >> +	spin_lock(&vfsmount_lock);
>> >> >> +	list_for_each_entry(mnt, &sb->s_vfsmount, mnt_sb_list) {
>> >> >> +		ret = !mnt_check_writers(mnt, 0);
>> >> >> +		if (ret)
>> >> >> +			break;
>> >> >> +	}
>> >> >> +	spin_unlock(&vfsmount_lock);
>> >> >> +	/*
>> >> >> +	 * If new writer appears after we have checked all vfsmounts.
>> >> >> +	 * Then ST_REMOUNT_RO bit will be cleared.
>> >> >> +	 */
>> >> >> +	if (!test_bit(ST_REMOUNT_RO, &sb->s_state))
>> >> >> +		ret = 0;
>> >> >> +	return ret;
>> >> >> +}
>> >> >   This misses the case when the superblock as unlinked-but-open files.
>> >> > In such case we have to fail remount RO as well. The original
>> >> > fs_may_remount_ro checks for that..
>> >> Since file is opened for write one of vfsmnt struct would have non
>> >> zero write count. So -EBUSY will be returned from fs_may_remount_ro()
>> >   But file can be open for reading only...
>> Ohh.. i see. what is the reason to fail RO remount due to unlinked
>> files? Is this because not all filesystem has orphan list?
>   Yes, this is the main reason.
>
>> Shame on such FS. Seems that i have to also grab write count
>> in unlink path if i_nlink becomes zero and drop on inode release.
>   Yes, I'd imagine some solution like this as well.
I've take a look at the code. It appear not that easy as it looks
from the first glance.
It is not obvious how to store mnt on which write count was
incremented during unlink. Also there is no dedicated mnt which
is assessable from SB. We have s_root but don't know in which mnt
this dentry belongs. So the only option is to introduce real
super_block counter and use it for unlinked inodes tracking.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2010-03-02 15:38 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-02-27  0:25 [PATCH 1/3] vfs: redesign sb state flags Dmitry Monakhov
2010-02-27  0:25 ` [PATCH 2/3] vfs: per-sb write count preparation stage Dmitry Monakhov
2010-02-27  0:25   ` [PATCH 3/3] vfs: fix filesystem_sync vs write race on rw=>ro remount v2 Dmitry Monakhov
2010-03-02 13:29     ` Jan Kara
2010-03-02 14:04       ` Dmitry Monakhov
2010-03-02 14:06         ` Jan Kara
2010-03-02 14:24           ` Dmitry Monakhov
2010-03-02 14:35             ` Jan Kara
2010-03-02 15:38               ` Dmitry Monakhov
2010-03-02 15:21     ` Nick Piggin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.