All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RESEND v4 0/8] Fix freeze/sysfs deadlock in better method.
@ 2015-01-29  2:24 Qu Wenruo
  2015-01-29  2:24 ` [PATCH RESEND v4 1/8] Revert "btrfs: add support for processing pending changes" related commits Qu Wenruo
                   ` (7 more replies)
  0 siblings, 8 replies; 40+ messages in thread
From: Qu Wenruo @ 2015-01-29  2:24 UTC (permalink / raw)
  To: linux-btrfs

Patchset to solve the previous found freeze/sysfs deadlock.

Unlike previous pending_changes, which uses transaction commits to
ensure mount option doesn't change during transaction.
This idea is great in concept, but will introduce extra and somewhat
duplicated ro/freeze check, and the original synchronized behavior is
also changed.

[Per-transaction mount option]
This patch use the RCU-like concept, which will copy the mount_opt from
fs_info into btrfs_transaction, and each btrfs_test_opt() for specific
mount option bit(SPACE_CACHE/INODE_MAP_CACHE) should be changed to
btrfs_test_trans_opt().
So mount option during transaction won't be changed, and also, no extra
read_only or freeze check is needed, the commit routine is also
untouched.

[Use VFS protect for sysfs change]
The 6th patch will introduce a new help function get_vfsmount_sb() to
get a vfsmount from a sb, with this vfsmount, we can use vfs level
mnt_want_write() to do a comprehensive protection other than the cheap
ro/freeze check.

Changelog:
v1:
   Only use cheap freeze check to avoid deadlock.
v2:
   Fix the never changed pending_changes bug and handle transaction in
   btrfs_freeze()
v3:
   Add atomic mount option change and per-trans mount option.
v4:
   Add mnt_want_write() in sysfs handler.

Qu Wenruo (8):
  Revert "btrfs: add support for processing pending changes" related    
    commits
  btrfs: Make btrfs_parse_options() parse mount option in a atomic way
  btrfs: Introduce per-transaction mount_opt to keep mount option    
    consistent during transaction.
  btrfs: Use btrfs_test_trans_opt() to handle SPACE_CACHE if it's under 
       transaction protect.
  btrfs: Use btrfs_test_trans_opt() to handle INODE_CACHE if it's under 
       transaction protect.
  vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
  btrfs: Use mnt_want_write() to protect label change.
  btrfs: Use mnt_want_write() to protect sysfs feature change.

 fs/btrfs/ctree.h            |  64 +++-------------------
 fs/btrfs/disk-io.c          |   6 ---
 fs/btrfs/extent-tree.c      |   2 +-
 fs/btrfs/free-space-cache.c |   2 +-
 fs/btrfs/inode-map.c        |   5 +-
 fs/btrfs/super.c            | 129 ++++++++++++++++++++++----------------------
 fs/btrfs/sysfs.c            |  73 ++++++++++++++++++-------
 fs/btrfs/transaction.c      |  43 ++-------------
 fs/btrfs/transaction.h      |   6 ++-
 fs/namespace.c              |  25 +++++++++
 include/linux/mount.h       |   1 +
 11 files changed, 167 insertions(+), 189 deletions(-)

-- 
2.2.2


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH RESEND v4 1/8] Revert "btrfs: add support for processing pending changes" related commits
  2015-01-29  2:24 [PATCH RESEND v4 0/8] Fix freeze/sysfs deadlock in better method Qu Wenruo
@ 2015-01-29  2:24 ` Qu Wenruo
  2015-01-29  2:24 ` [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount option in a atomic way Qu Wenruo
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2015-01-29  2:24 UTC (permalink / raw)
  To: linux-btrfs

This reverts commit 572d9ab7845 ~ a6f69dc8018.

This pending commits patches introduce deadlock with freeze, and fix for
it will introduce extra checks on freeze and read only case.

For mount option change, later patches will introduce per-transaction
mount option to keep mount options consistent during transaction.

For sysfs interface to change label/features, it will use
mnt_want_write() to do the ro/freeze check and keep the 'btrfs pro set'
synchronized behavior.

Revert them to a clean base for later changes.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>

Conflicts:
	fs/btrfs/ctree.h
	fs/btrfs/super.c
---
 fs/btrfs/ctree.h       | 49 +------------------------------------------------
 fs/btrfs/disk-io.c     |  8 +++-----
 fs/btrfs/inode-map.c   |  2 +-
 fs/btrfs/super.c       | 14 +++-----------
 fs/btrfs/sysfs.c       | 34 +++++++++++++++++++++-------------
 fs/btrfs/transaction.c | 38 ++++++--------------------------------
 fs/btrfs/transaction.h |  2 --
 7 files changed, 35 insertions(+), 112 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7e60741..5f99743 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1411,11 +1411,6 @@ struct btrfs_fs_info {
 	 */
 	u64 last_trans_log_full_commit;
 	unsigned long mount_opt;
-	/*
-	 * Track requests for actions that need to be done during transaction
-	 * commit (like for some mount options).
-	 */
-	unsigned long pending_changes;
 	unsigned long compress_type:4;
 	int commit_interval;
 	/*
@@ -2113,6 +2108,7 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21)
 #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR	(1 << 22)
 #define BTRFS_MOUNT_RESCAN_UUID_TREE	(1 << 23)
+#define	BTRFS_MOUNT_CHANGE_INODE_CACHE	(1 << 24)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
 #define BTRFS_DEFAULT_MAX_INLINE	(8192)
@@ -2138,49 +2134,6 @@ struct btrfs_ioctl_defrag_range_args {
 }
 
 /*
- * Requests for changes that need to be done during transaction commit.
- *
- * Internal mount options that are used for special handling of the real
- * mount options (eg. cannot be set during remount and have to be set during
- * transaction commit)
- */
-
-#define BTRFS_PENDING_SET_INODE_MAP_CACHE	(0)
-#define BTRFS_PENDING_CLEAR_INODE_MAP_CACHE	(1)
-#define BTRFS_PENDING_COMMIT			(2)
-
-#define btrfs_test_pending(info, opt)	\
-	test_bit(BTRFS_PENDING_##opt, &(info)->pending_changes)
-#define btrfs_set_pending(info, opt)	\
-	set_bit(BTRFS_PENDING_##opt, &(info)->pending_changes)
-#define btrfs_clear_pending(info, opt)	\
-	clear_bit(BTRFS_PENDING_##opt, &(info)->pending_changes)
-
-/*
- * Helpers for setting pending mount option changes.
- *
- * Expects corresponding macros
- * BTRFS_PENDING_SET_ and CLEAR_ + short mount option name
- */
-#define btrfs_set_pending_and_info(info, opt, fmt, args...)            \
-do {                                                                   \
-       if (!btrfs_raw_test_opt((info)->mount_opt, opt)) {              \
-               btrfs_info((info), fmt, ##args);                        \
-               btrfs_set_pending((info), SET_##opt);                   \
-               btrfs_clear_pending((info), CLEAR_##opt);               \
-       }                                                               \
-} while(0)
-
-#define btrfs_clear_pending_and_info(info, opt, fmt, args...)          \
-do {                                                                   \
-       if (btrfs_raw_test_opt((info)->mount_opt, opt)) {               \
-               btrfs_info((info), fmt, ##args);                        \
-               btrfs_set_pending((info), CLEAR_##opt);                 \
-               btrfs_clear_pending((info), SET_##opt);                 \
-       }                                                               \
-} while(0)
-
-/*
  * Inode flags
  */
 #define BTRFS_INODE_NODATASUM		(1 << 0)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8c63419..2d3c8b7 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2832,11 +2832,9 @@ retry_root_backup:
 		btrfs_set_opt(fs_info->mount_opt, SSD);
 	}
 
-	/*
-	 * Mount does not set all options immediatelly, we can do it now and do
-	 * not have to wait for transaction commit
-	 */
-	btrfs_apply_pending_changes(fs_info);
+	/* Set the real inode map cache flag */
+	if (btrfs_test_opt(tree_root, CHANGE_INODE_CACHE))
+		btrfs_set_opt(tree_root->fs_info->mount_opt, INODE_MAP_CACHE);
 
 #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
 	if (btrfs_test_opt(tree_root, CHECK_INTEGRITY)) {
diff --git a/fs/btrfs/inode-map.c b/fs/btrfs/inode-map.c
index 74faea3..81efd83 100644
--- a/fs/btrfs/inode-map.c
+++ b/fs/btrfs/inode-map.c
@@ -178,7 +178,7 @@ static void start_caching(struct btrfs_root *root)
 			  root->root_key.objectid);
 	if (IS_ERR(tsk)) {
 		btrfs_warn(root->fs_info, "failed to start inode caching task");
-		btrfs_clear_pending_and_info(root->fs_info, INODE_MAP_CACHE,
+		btrfs_clear_and_info(root, CHANGE_INODE_CACHE,
 				"disabling inode map caching");
 	}
 }
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 60f7cbe..b0c45b2 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -993,17 +993,9 @@ int btrfs_sync_fs(struct super_block *sb, int wait)
 	trans = btrfs_attach_transaction_barrier(root);
 	if (IS_ERR(trans)) {
 		/* no transaction, don't bother */
-		if (PTR_ERR(trans) == -ENOENT) {
-			/*
-			 * Exit unless we have some pending changes
-			 * that need to go through commit
-			 */
-			if (fs_info->pending_changes == 0)
-				return 0;
-			trans = btrfs_start_transaction(root, 0);
-		} else {
-			return PTR_ERR(trans);
-		}
+		if (PTR_ERR(trans) == -ENOENT)
+			return 0;
+		return PTR_ERR(trans);
 	}
 	return btrfs_commit_transaction(trans, root);
 }
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 92db3f6..b2e7bb4 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -111,6 +111,7 @@ static ssize_t btrfs_feature_attr_store(struct kobject *kobj,
 {
 	struct btrfs_fs_info *fs_info;
 	struct btrfs_feature_attr *fa = to_btrfs_feature_attr(a);
+	struct btrfs_trans_handle *trans;
 	u64 features, set, clear;
 	unsigned long val;
 	int ret;
@@ -152,6 +153,10 @@ static ssize_t btrfs_feature_attr_store(struct kobject *kobj,
 	btrfs_info(fs_info, "%s %s feature flag",
 		   val ? "Setting" : "Clearing", fa->kobj_attr.attr.name);
 
+	trans = btrfs_start_transaction(fs_info->fs_root, 0);
+	if (IS_ERR(trans))
+		return PTR_ERR(trans);
+
 	spin_lock(&fs_info->super_lock);
 	features = get_features(fs_info, fa->feature_set);
 	if (val)
@@ -161,11 +166,9 @@ static ssize_t btrfs_feature_attr_store(struct kobject *kobj,
 	set_features(fs_info, fa->feature_set, features);
 	spin_unlock(&fs_info->super_lock);
 
-	/*
-	 * We don't want to do full transaction commit from inside sysfs
-	 */
-	btrfs_set_pending(fs_info, COMMIT);
-	wake_up_process(fs_info->transaction_kthread);
+	ret = btrfs_commit_transaction(trans, fs_info->fs_root);
+	if (ret)
+		return ret;
 
 	return count;
 }
@@ -369,6 +372,9 @@ static ssize_t btrfs_label_store(struct kobject *kobj,
 				 const char *buf, size_t len)
 {
 	struct btrfs_fs_info *fs_info = to_fs_info(kobj);
+	struct btrfs_trans_handle *trans;
+	struct btrfs_root *root = fs_info->fs_root;
+	int ret;
 	size_t p_len;
 
 	if (fs_info->sb->s_flags & MS_RDONLY)
@@ -383,18 +389,20 @@ static ssize_t btrfs_label_store(struct kobject *kobj,
 	if (p_len >= BTRFS_LABEL_SIZE)
 		return -EINVAL;
 
-	spin_lock(&fs_info->super_lock);
+	trans = btrfs_start_transaction(root, 0);
+	if (IS_ERR(trans))
+		return PTR_ERR(trans);
+
+	spin_lock(&root->fs_info->super_lock);
 	memset(fs_info->super_copy->label, 0, BTRFS_LABEL_SIZE);
 	memcpy(fs_info->super_copy->label, buf, p_len);
-	spin_unlock(&fs_info->super_lock);
+	spin_unlock(&root->fs_info->super_lock);
+	ret = btrfs_commit_transaction(trans, root);
 
-	/*
-	 * We don't want to do full transaction commit from inside sysfs
-	 */
-	btrfs_set_pending(fs_info, COMMIT);
-	wake_up_process(fs_info->transaction_kthread);
+	if (!ret)
+		return len;
 
-	return len;
+	return ret;
 }
 BTRFS_ATTR_RW(label, btrfs_label_show, btrfs_label_store);
 
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index a605d4e..295a135 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1938,10 +1938,13 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 	}
 
 	/*
-	 * Since the transaction is done, we can apply the pending changes
-	 * before the next transaction.
+	 * Since the transaction is done, we should set the inode map cache flag
+	 * before any other comming transaction.
 	 */
-	btrfs_apply_pending_changes(root->fs_info);
+	if (btrfs_test_opt(root, CHANGE_INODE_CACHE))
+		btrfs_set_opt(root->fs_info->mount_opt, INODE_MAP_CACHE);
+	else
+		btrfs_clear_opt(root->fs_info->mount_opt, INODE_MAP_CACHE);
 
 	/* commit_fs_roots gets rid of all the tree log roots, it is now
 	 * safe to free the root of tree log roots
@@ -2112,32 +2115,3 @@ int btrfs_clean_one_deleted_snapshot(struct btrfs_root *root)
 
 	return (ret < 0) ? 0 : 1;
 }
-
-void btrfs_apply_pending_changes(struct btrfs_fs_info *fs_info)
-{
-	unsigned long prev;
-	unsigned long bit;
-
-	prev = cmpxchg(&fs_info->pending_changes, 0, 0);
-	if (!prev)
-		return;
-
-	bit = 1 << BTRFS_PENDING_SET_INODE_MAP_CACHE;
-	if (prev & bit)
-		btrfs_set_opt(fs_info->mount_opt, INODE_MAP_CACHE);
-	prev &= ~bit;
-
-	bit = 1 << BTRFS_PENDING_CLEAR_INODE_MAP_CACHE;
-	if (prev & bit)
-		btrfs_clear_opt(fs_info->mount_opt, INODE_MAP_CACHE);
-	prev &= ~bit;
-
-	bit = 1 << BTRFS_PENDING_COMMIT;
-	if (prev & bit)
-		btrfs_debug(fs_info, "pending commit done");
-	prev &= ~bit;
-
-	if (prev)
-		btrfs_warn(fs_info,
-			"unknown pending changes left 0x%lx, ignoring", prev);
-}
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 00ed29c..fd400a3 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -170,6 +170,4 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
 int btrfs_transaction_blocked(struct btrfs_fs_info *info);
 int btrfs_transaction_in_commit(struct btrfs_fs_info *info);
 void btrfs_put_transaction(struct btrfs_transaction *transaction);
-void btrfs_apply_pending_changes(struct btrfs_fs_info *fs_info);
-
 #endif
-- 
2.2.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount option in a atomic way
  2015-01-29  2:24 [PATCH RESEND v4 0/8] Fix freeze/sysfs deadlock in better method Qu Wenruo
  2015-01-29  2:24 ` [PATCH RESEND v4 1/8] Revert "btrfs: add support for processing pending changes" related commits Qu Wenruo
@ 2015-01-29  2:24 ` Qu Wenruo
  2015-01-30  0:31   ` Miao Xie
  2015-01-29  2:24 ` [PATCH RESEND v4 3/8] btrfs: Introduce per-transaction mount_opt to keep mount option consistent during transaction Qu Wenruo
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 40+ messages in thread
From: Qu Wenruo @ 2015-01-29  2:24 UTC (permalink / raw)
  To: linux-btrfs

Current btrfs_parse_options() is not atomic, which can set and clear a
bit, especially for nospace_cache case.

For example, if a fs is mounted with nospace_cache,
btrfs_parse_options() will set SPACE_CACHE bit first(since
cache_generation is non-zeo) and clear the SPACE_CACHE bit due to
nospace_cache mount option.
So under heavy operations and remount a nospace_cache btrfs, there is a
windows for commit to create space cache.

This bug can be reproduced by fstest/btrfs/071 073 074 with
nospace_cache mount option. It has about 50% chance to create space
cache, and about 10% chance to create wrong space cache, which can't
pass btrfsck.

This patch will do the mount option parse in a copy-and-update method.
First copy the mount_opt from fs_info to new_opt, and only update
options in new_opt. At last, copy the new_opt back to
fs_info->mount_opt.

This patch is already good enough to fix the above nospace_cache +
remount bug, but need later patch to make sure mount options does not
change during transaction.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/ctree.h |  16 ++++----
 fs/btrfs/super.c | 115 +++++++++++++++++++++++++++++--------------------------
 2 files changed, 69 insertions(+), 62 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5f99743..26bb47b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2119,18 +2119,18 @@ struct btrfs_ioctl_defrag_range_args {
 #define btrfs_test_opt(root, opt)	((root)->fs_info->mount_opt & \
 					 BTRFS_MOUNT_##opt)
 
-#define btrfs_set_and_info(root, opt, fmt, args...)			\
+#define btrfs_set_and_info(fs_info, val, opt, fmt, args...)		\
 {									\
-	if (!btrfs_test_opt(root, opt))					\
-		btrfs_info(root->fs_info, fmt, ##args);			\
-	btrfs_set_opt(root->fs_info->mount_opt, opt);			\
+	if (!btrfs_raw_test_opt(val, opt))				\
+		btrfs_info(fs_info, fmt, ##args);			\
+	btrfs_set_opt(val, opt);					\
 }
 
-#define btrfs_clear_and_info(root, opt, fmt, args...)			\
+#define btrfs_clear_and_info(fs_info, val, opt, fmt, args...)		\
 {									\
-	if (btrfs_test_opt(root, opt))					\
-		btrfs_info(root->fs_info, fmt, ##args);			\
-	btrfs_clear_opt(root->fs_info->mount_opt, opt);			\
+	if (btrfs_raw_test_opt(val, opt))				\
+		btrfs_info(fs_info, fmt, ##args);			\
+	btrfs_clear_opt(val, opt);					\
 }
 
 /*
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index b0c45b2..490fe1f 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -395,10 +395,13 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 	int ret = 0;
 	char *compress_type;
 	bool compress_force = false;
+	unsigned long new_opt;
+
+	new_opt = info->mount_opt;
 
 	cache_gen = btrfs_super_cache_generation(root->fs_info->super_copy);
 	if (cache_gen)
-		btrfs_set_opt(info->mount_opt, SPACE_CACHE);
+		btrfs_set_opt(new_opt, SPACE_CACHE);
 
 	if (!options)
 		goto out;
@@ -422,7 +425,7 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 		switch (token) {
 		case Opt_degraded:
 			btrfs_info(root->fs_info, "allowing degraded mounts");
-			btrfs_set_opt(info->mount_opt, DEGRADED);
+			btrfs_set_opt(new_opt, DEGRADED);
 			break;
 		case Opt_subvol:
 		case Opt_subvolid:
@@ -434,7 +437,7 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 			 */
 			break;
 		case Opt_nodatasum:
-			btrfs_set_and_info(root, NODATASUM,
+			btrfs_set_and_info(info, new_opt, NODATASUM,
 					   "setting nodatasum");
 			break;
 		case Opt_datasum:
@@ -444,8 +447,8 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 				else
 					btrfs_info(root->fs_info, "setting datasum");
 			}
-			btrfs_clear_opt(info->mount_opt, NODATACOW);
-			btrfs_clear_opt(info->mount_opt, NODATASUM);
+			btrfs_clear_opt(new_opt, NODATACOW);
+			btrfs_clear_opt(new_opt, NODATASUM);
 			break;
 		case Opt_nodatacow:
 			if (!btrfs_test_opt(root, NODATACOW)) {
@@ -457,13 +460,13 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 					btrfs_info(root->fs_info, "setting nodatacow");
 				}
 			}
-			btrfs_clear_opt(info->mount_opt, COMPRESS);
-			btrfs_clear_opt(info->mount_opt, FORCE_COMPRESS);
-			btrfs_set_opt(info->mount_opt, NODATACOW);
-			btrfs_set_opt(info->mount_opt, NODATASUM);
+			btrfs_clear_opt(new_opt, COMPRESS);
+			btrfs_clear_opt(new_opt, FORCE_COMPRESS);
+			btrfs_set_opt(new_opt, NODATACOW);
+			btrfs_set_opt(new_opt, NODATASUM);
 			break;
 		case Opt_datacow:
-			btrfs_clear_and_info(root, NODATACOW,
+			btrfs_clear_and_info(info, new_opt, NODATACOW,
 					     "setting datacow");
 			break;
 		case Opt_compress_force:
@@ -477,20 +480,20 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 			    strcmp(args[0].from, "zlib") == 0) {
 				compress_type = "zlib";
 				info->compress_type = BTRFS_COMPRESS_ZLIB;
-				btrfs_set_opt(info->mount_opt, COMPRESS);
-				btrfs_clear_opt(info->mount_opt, NODATACOW);
-				btrfs_clear_opt(info->mount_opt, NODATASUM);
+				btrfs_set_opt(new_opt, COMPRESS);
+				btrfs_clear_opt(new_opt, NODATACOW);
+				btrfs_clear_opt(new_opt, NODATASUM);
 			} else if (strcmp(args[0].from, "lzo") == 0) {
 				compress_type = "lzo";
 				info->compress_type = BTRFS_COMPRESS_LZO;
-				btrfs_set_opt(info->mount_opt, COMPRESS);
-				btrfs_clear_opt(info->mount_opt, NODATACOW);
-				btrfs_clear_opt(info->mount_opt, NODATASUM);
+				btrfs_set_opt(new_opt, COMPRESS);
+				btrfs_clear_opt(new_opt, NODATACOW);
+				btrfs_clear_opt(new_opt, NODATASUM);
 				btrfs_set_fs_incompat(info, COMPRESS_LZO);
 			} else if (strncmp(args[0].from, "no", 2) == 0) {
 				compress_type = "no";
-				btrfs_clear_opt(info->mount_opt, COMPRESS);
-				btrfs_clear_opt(info->mount_opt, FORCE_COMPRESS);
+				btrfs_clear_opt(new_opt, COMPRESS);
+				btrfs_clear_opt(new_opt, FORCE_COMPRESS);
 				compress_force = false;
 			} else {
 				ret = -EINVAL;
@@ -498,7 +501,8 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 			}
 
 			if (compress_force) {
-				btrfs_set_and_info(root, FORCE_COMPRESS,
+				btrfs_set_and_info(info, new_opt,
+						   FORCE_COMPRESS,
 						   "force %s compression",
 						   compress_type);
 			} else {
@@ -512,29 +516,29 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 				 * flag, otherwise, there is no way for users
 				 * to disable forcible compression separately.
 				 */
-				btrfs_clear_opt(info->mount_opt, FORCE_COMPRESS);
+				btrfs_clear_opt(new_opt, FORCE_COMPRESS);
 			}
 			break;
 		case Opt_ssd:
-			btrfs_set_and_info(root, SSD,
+			btrfs_set_and_info(info, new_opt, SSD,
 					   "use ssd allocation scheme");
 			break;
 		case Opt_ssd_spread:
-			btrfs_set_and_info(root, SSD_SPREAD,
+			btrfs_set_and_info(info, new_opt, SSD_SPREAD,
 					   "use spread ssd allocation scheme");
-			btrfs_set_opt(info->mount_opt, SSD);
+			btrfs_set_opt(new_opt, SSD);
 			break;
 		case Opt_nossd:
-			btrfs_set_and_info(root, NOSSD,
-					     "not using ssd allocation scheme");
-			btrfs_clear_opt(info->mount_opt, SSD);
+			btrfs_set_and_info(info, new_opt, NOSSD,
+					   "not using ssd allocation scheme");
+			btrfs_clear_opt(new_opt, SSD);
 			break;
 		case Opt_barrier:
-			btrfs_clear_and_info(root, NOBARRIER,
+			btrfs_clear_and_info(info, new_opt, NOBARRIER,
 					     "turning on barriers");
 			break;
 		case Opt_nobarrier:
-			btrfs_set_and_info(root, NOBARRIER,
+			btrfs_set_and_info(info, new_opt, NOBARRIER,
 					   "turning off barriers");
 			break;
 		case Opt_thread_pool:
@@ -594,19 +598,19 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 			root->fs_info->sb->s_flags &= ~MS_POSIXACL;
 			break;
 		case Opt_notreelog:
-			btrfs_set_and_info(root, NOTREELOG,
+			btrfs_set_and_info(info, new_opt, NOTREELOG,
 					   "disabling tree log");
 			break;
 		case Opt_treelog:
-			btrfs_clear_and_info(root, NOTREELOG,
+			btrfs_clear_and_info(info, new_opt, NOTREELOG,
 					     "enabling tree log");
 			break;
 		case Opt_flushoncommit:
-			btrfs_set_and_info(root, FLUSHONCOMMIT,
+			btrfs_set_and_info(info, new_opt, FLUSHONCOMMIT,
 					   "turning on flush-on-commit");
 			break;
 		case Opt_noflushoncommit:
-			btrfs_clear_and_info(root, FLUSHONCOMMIT,
+			btrfs_clear_and_info(info, new_opt, FLUSHONCOMMIT,
 					     "turning off flush-on-commit");
 			break;
 		case Opt_ratio:
@@ -623,71 +627,71 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 			}
 			break;
 		case Opt_discard:
-			btrfs_set_and_info(root, DISCARD,
+			btrfs_set_and_info(info, new_opt, DISCARD,
 					   "turning on discard");
 			break;
 		case Opt_nodiscard:
-			btrfs_clear_and_info(root, DISCARD,
+			btrfs_clear_and_info(info, new_opt, DISCARD,
 					     "turning off discard");
 			break;
 		case Opt_space_cache:
-			btrfs_set_and_info(root, SPACE_CACHE,
+			btrfs_set_and_info(info, new_opt, SPACE_CACHE,
 					   "enabling disk space caching");
 			break;
 		case Opt_rescan_uuid_tree:
-			btrfs_set_opt(info->mount_opt, RESCAN_UUID_TREE);
+			btrfs_set_opt(new_opt, RESCAN_UUID_TREE);
 			break;
 		case Opt_no_space_cache:
-			btrfs_clear_and_info(root, SPACE_CACHE,
+			btrfs_clear_and_info(info, new_opt, SPACE_CACHE,
 					     "disabling disk space caching");
 			break;
 		case Opt_inode_cache:
-			btrfs_set_pending_and_info(info, INODE_MAP_CACHE,
+			btrfs_set_and_info(info, new_opt, INODE_MAP_CACHE,
 					   "enabling inode map caching");
 			break;
 		case Opt_noinode_cache:
-			btrfs_clear_pending_and_info(info, INODE_MAP_CACHE,
+			btrfs_clear_and_info(info, new_opt, INODE_MAP_CACHE,
 					     "disabling inode map caching");
 			break;
 		case Opt_clear_cache:
-			btrfs_set_and_info(root, CLEAR_CACHE,
+			btrfs_set_and_info(info, new_opt, CLEAR_CACHE,
 					   "force clearing of disk cache");
 			break;
 		case Opt_user_subvol_rm_allowed:
-			btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED);
+			btrfs_set_opt(new_opt, USER_SUBVOL_RM_ALLOWED);
 			break;
 		case Opt_enospc_debug:
-			btrfs_set_opt(info->mount_opt, ENOSPC_DEBUG);
+			btrfs_set_opt(new_opt, ENOSPC_DEBUG);
 			break;
 		case Opt_noenospc_debug:
-			btrfs_clear_opt(info->mount_opt, ENOSPC_DEBUG);
+			btrfs_clear_opt(new_opt, ENOSPC_DEBUG);
 			break;
 		case Opt_defrag:
-			btrfs_set_and_info(root, AUTO_DEFRAG,
+			btrfs_set_and_info(info, new_opt, AUTO_DEFRAG,
 					   "enabling auto defrag");
 			break;
 		case Opt_nodefrag:
-			btrfs_clear_and_info(root, AUTO_DEFRAG,
+			btrfs_clear_and_info(info, new_opt, AUTO_DEFRAG,
 					     "disabling auto defrag");
 			break;
 		case Opt_recovery:
 			btrfs_info(root->fs_info, "enabling auto recovery");
-			btrfs_set_opt(info->mount_opt, RECOVERY);
+			btrfs_set_opt(new_opt, RECOVERY);
 			break;
 		case Opt_skip_balance:
-			btrfs_set_opt(info->mount_opt, SKIP_BALANCE);
+			btrfs_set_opt(new_opt, SKIP_BALANCE);
 			break;
 #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
 		case Opt_check_integrity_including_extent_data:
 			btrfs_info(root->fs_info,
 				   "enabling check integrity including extent data");
-			btrfs_set_opt(info->mount_opt,
+			btrfs_set_opt(new_opt,
 				      CHECK_INTEGRITY_INCLUDING_EXTENT_DATA);
-			btrfs_set_opt(info->mount_opt, CHECK_INTEGRITY);
+			btrfs_set_opt(new_opt, CHECK_INTEGRITY);
 			break;
 		case Opt_check_integrity:
 			btrfs_info(root->fs_info, "enabling check integrity");
-			btrfs_set_opt(info->mount_opt, CHECK_INTEGRITY);
+			btrfs_set_opt(new_opt, CHECK_INTEGRITY);
 			break;
 		case Opt_check_integrity_print_mask:
 			ret = match_int(&args[0], &intarg);
@@ -713,10 +717,10 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 #endif
 		case Opt_fatal_errors:
 			if (strcmp(args[0].from, "panic") == 0)
-				btrfs_set_opt(info->mount_opt,
+				btrfs_set_opt(new_opt,
 					      PANIC_ON_FATAL_ERROR);
 			else if (strcmp(args[0].from, "bug") == 0)
-				btrfs_clear_opt(info->mount_opt,
+				btrfs_clear_opt(new_opt,
 					      PANIC_ON_FATAL_ERROR);
 			else {
 				ret = -EINVAL;
@@ -752,8 +756,11 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 		}
 	}
 out:
-	if (!ret && btrfs_test_opt(root, SPACE_CACHE))
-		btrfs_info(root->fs_info, "disk space caching is enabled");
+	if (!ret) {
+		if (btrfs_raw_test_opt(new_opt, SPACE_CACHE))
+			btrfs_info(info, "disk space caching is enabled");
+		info->mount_opt = new_opt;
+	}
 	kfree(orig);
 	return ret;
 }
-- 
2.2.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RESEND v4 3/8] btrfs: Introduce per-transaction mount_opt to keep mount option consistent during transaction.
  2015-01-29  2:24 [PATCH RESEND v4 0/8] Fix freeze/sysfs deadlock in better method Qu Wenruo
  2015-01-29  2:24 ` [PATCH RESEND v4 1/8] Revert "btrfs: add support for processing pending changes" related commits Qu Wenruo
  2015-01-29  2:24 ` [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount option in a atomic way Qu Wenruo
@ 2015-01-29  2:24 ` Qu Wenruo
  2015-01-29  2:24 ` [PATCH RESEND v4 4/8] btrfs: Use btrfs_test_trans_opt() to handle SPACE_CACHE if it's under transaction protect Qu Wenruo
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2015-01-29  2:24 UTC (permalink / raw)
  To: linux-btrfs

Before this patch, mount_opt is not consistent during a transaction.
btrfs_parse_options() can race with transaction.

Now each transaction will keep a copy of fs_info->mount_opt upon
creation, and new btrfs_test_trans_opt() macro is introduced to get the
mount_opt in the transaction.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/transaction.c | 1 +
 fs/btrfs/transaction.h | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 295a135..846e1b8 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -210,6 +210,7 @@ loop:
 		return -EROFS;
 	}
 
+	cur_trans->mount_opt = fs_info->mount_opt;
 	atomic_set(&cur_trans->num_writers, 1);
 	extwriter_counter_init(cur_trans, type);
 	init_waitqueue_head(&cur_trans->writer_wait);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index fd400a3..4052879 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -52,6 +52,7 @@ struct btrfs_transaction {
 	struct list_head list;
 	struct extent_io_tree dirty_pages;
 	unsigned long start_time;
+	unsigned long mount_opt;
 	wait_queue_head_t writer_wait;
 	wait_queue_head_t commit_wait;
 	struct list_head pending_snapshots;
@@ -126,6 +127,9 @@ struct btrfs_pending_snapshot {
 	struct list_head list;
 };
 
+#define btrfs_test_trans_opt(trans, opt)	\
+	(btrfs_raw_test_opt(trans->transaction->mount_opt, opt))
+
 static inline void btrfs_set_inode_last_trans(struct btrfs_trans_handle *trans,
 					      struct inode *inode)
 {
-- 
2.2.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RESEND v4 4/8] btrfs: Use btrfs_test_trans_opt() to handle SPACE_CACHE if it's under transaction protect.
  2015-01-29  2:24 [PATCH RESEND v4 0/8] Fix freeze/sysfs deadlock in better method Qu Wenruo
                   ` (2 preceding siblings ...)
  2015-01-29  2:24 ` [PATCH RESEND v4 3/8] btrfs: Introduce per-transaction mount_opt to keep mount option consistent during transaction Qu Wenruo
@ 2015-01-29  2:24 ` Qu Wenruo
  2015-01-29  2:24 ` [PATCH RESEND v4 5/8] btrfs: Use btrfs_test_trans_opt() to handle INODE_CACHE " Qu Wenruo
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2015-01-29  2:24 UTC (permalink / raw)
  To: linux-btrfs

Convert btrfs_test_opt() to btrfs_test_trans_opt() if it's called under
transaction protection.
This will ensure SPACE_CACHE bit is consistent during transaction.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c | 2 +-
 fs/btrfs/transaction.c | 7 ++++---
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1511658..c968d12 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3260,7 +3260,7 @@ again:
 
 	spin_lock(&block_group->lock);
 	if (block_group->cached != BTRFS_CACHE_FINISHED ||
-	    !btrfs_test_opt(root, SPACE_CACHE) ||
+	    !btrfs_test_trans_opt(trans, SPACE_CACHE) ||
 	    block_group->delalloc_bytes) {
 		/*
 		 * don't bother trying to write stuff out _if_
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 846e1b8..aec5a5a 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1507,7 +1507,8 @@ static noinline int create_pending_snapshots(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
-static void update_super_roots(struct btrfs_root *root)
+static void update_super_roots(struct btrfs_trans_handle *trans,
+			       struct btrfs_root *root)
 {
 	struct btrfs_root_item *root_item;
 	struct btrfs_super_block *super;
@@ -1523,7 +1524,7 @@ static void update_super_roots(struct btrfs_root *root)
 	super->root = root_item->bytenr;
 	super->generation = root_item->generation;
 	super->root_level = root_item->level;
-	if (btrfs_test_opt(root, SPACE_CACHE))
+	if (btrfs_test_trans_opt(trans, SPACE_CACHE))
 		super->cache_generation = root_item->generation;
 	if (root->fs_info->update_uuid_tree_gen)
 		super->uuid_tree_generation = root_item->generation;
@@ -1987,7 +1988,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 	switch_commit_roots(cur_trans, root->fs_info);
 
 	assert_qgroups_uptodate(trans);
-	update_super_roots(root);
+	update_super_roots(trans, root);
 
 	btrfs_set_super_log_root(root->fs_info->super_copy, 0);
 	btrfs_set_super_log_root_level(root->fs_info->super_copy, 0);
-- 
2.2.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RESEND v4 5/8] btrfs: Use btrfs_test_trans_opt() to handle INODE_CACHE if it's under transaction protect.
  2015-01-29  2:24 [PATCH RESEND v4 0/8] Fix freeze/sysfs deadlock in better method Qu Wenruo
                   ` (3 preceding siblings ...)
  2015-01-29  2:24 ` [PATCH RESEND v4 4/8] btrfs: Use btrfs_test_trans_opt() to handle SPACE_CACHE if it's under transaction protect Qu Wenruo
@ 2015-01-29  2:24 ` Qu Wenruo
  2015-01-29  2:24 ` [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb Qu Wenruo
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2015-01-29  2:24 UTC (permalink / raw)
  To: linux-btrfs

Convert btrfs_test_opt() to btrfs_test_trans_opt() if it's called under
transaction protection.

This will ensure SPACE_CACHE bit is consistent during transaction.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/ctree.h            | 1 -
 fs/btrfs/disk-io.c          | 4 ----
 fs/btrfs/free-space-cache.c | 2 +-
 fs/btrfs/inode-map.c        | 5 +++--
 fs/btrfs/transaction.c      | 9 ---------
 5 files changed, 4 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 26bb47b..fa04db9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2108,7 +2108,6 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21)
 #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR	(1 << 22)
 #define BTRFS_MOUNT_RESCAN_UUID_TREE	(1 << 23)
-#define	BTRFS_MOUNT_CHANGE_INODE_CACHE	(1 << 24)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
 #define BTRFS_DEFAULT_MAX_INLINE	(8192)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2d3c8b7..f4d168d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2832,10 +2832,6 @@ retry_root_backup:
 		btrfs_set_opt(fs_info->mount_opt, SSD);
 	}
 
-	/* Set the real inode map cache flag */
-	if (btrfs_test_opt(tree_root, CHANGE_INODE_CACHE))
-		btrfs_set_opt(tree_root->fs_info->mount_opt, INODE_MAP_CACHE);
-
 #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
 	if (btrfs_test_opt(tree_root, CHECK_INTEGRITY)) {
 		ret = btrfsic_mount(tree_root, fs_devices,
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index d6c03f7..2b9cabf 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3348,7 +3348,7 @@ int btrfs_write_out_ino_cache(struct btrfs_root *root,
 	struct btrfs_free_space_ctl *ctl = root->free_ino_ctl;
 	int ret;
 
-	if (!btrfs_test_opt(root, INODE_MAP_CACHE))
+	if (!btrfs_test_trans_opt(trans, INODE_MAP_CACHE))
 		return 0;
 
 	ret = __btrfs_write_out_cache(root, inode, ctl, NULL, trans, path, 0);
diff --git a/fs/btrfs/inode-map.c b/fs/btrfs/inode-map.c
index 81efd83..49b089c 100644
--- a/fs/btrfs/inode-map.c
+++ b/fs/btrfs/inode-map.c
@@ -178,7 +178,8 @@ static void start_caching(struct btrfs_root *root)
 			  root->root_key.objectid);
 	if (IS_ERR(tsk)) {
 		btrfs_warn(root->fs_info, "failed to start inode caching task");
-		btrfs_clear_and_info(root, CHANGE_INODE_CACHE,
+		btrfs_clear_and_info(root->fs_info, root->fs_info->mount_opt,
+				INODE_MAP_CACHE,
 				"disabling inode map caching");
 	}
 }
@@ -405,7 +406,7 @@ int btrfs_save_ino_cache(struct btrfs_root *root,
 	if (btrfs_root_refs(&root->root_item) == 0)
 		return 0;
 
-	if (!btrfs_test_opt(root, INODE_MAP_CACHE))
+	if (!btrfs_test_trans_opt(trans, INODE_MAP_CACHE))
 		return 0;
 
 	path = btrfs_alloc_path();
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index aec5a5a..dd9e7af 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1939,15 +1939,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 		goto scrub_continue;
 	}
 
-	/*
-	 * Since the transaction is done, we should set the inode map cache flag
-	 * before any other comming transaction.
-	 */
-	if (btrfs_test_opt(root, CHANGE_INODE_CACHE))
-		btrfs_set_opt(root->fs_info->mount_opt, INODE_MAP_CACHE);
-	else
-		btrfs_clear_opt(root->fs_info->mount_opt, INODE_MAP_CACHE);
-
 	/* commit_fs_roots gets rid of all the tree log roots, it is now
 	 * safe to free the root of tree log roots
 	 */
-- 
2.2.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
  2015-01-29  2:24 [PATCH RESEND v4 0/8] Fix freeze/sysfs deadlock in better method Qu Wenruo
                   ` (4 preceding siblings ...)
  2015-01-29  2:24 ` [PATCH RESEND v4 5/8] btrfs: Use btrfs_test_trans_opt() to handle INODE_CACHE " Qu Wenruo
@ 2015-01-29  2:24 ` Qu Wenruo
  2015-01-29 12:37   ` David Sterba
  2015-01-30  0:52   ` Miao Xie
  2015-01-29  2:24 ` [PATCH RESEND v4 7/8] btrfs: Use mnt_want_write() to protect label change Qu Wenruo
  2015-01-29  2:24 ` [PATCH RESEND v4 8/8] btrfs: Use mnt_want_write() to protect sysfs feature change Qu Wenruo
  7 siblings, 2 replies; 40+ messages in thread
From: Qu Wenruo @ 2015-01-29  2:24 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel

There are sysfs interfaces in some fs, only btrfs yet, which will modify
on-disk data.
Unlike normal file operation routine we can use mnt_want_write_file() to
protect the operation, change through sysfs won't to be binded to any file
in the filesystem.
So we can only extract the first vfsmount of a superblock and pass it to
mnt_want_write() to do the protection.

Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/namespace.c        | 25 +++++++++++++++++++++++++
 include/linux/mount.h |  1 +
 2 files changed, 26 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index cd1e968..5a16a62 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1105,6 +1105,31 @@ struct vfsmount *mntget(struct vfsmount *mnt)
 }
 EXPORT_SYMBOL(mntget);
 
+/*
+ * get a vfsmount from a given sb
+ *
+ * This is especially used for case where change fs' sysfs interface
+ * will lead to a write, e.g. Change label through sysfs in btrfs.
+ * So vfs can get a vfsmount and then use mnt_want_write() to protect.
+ */
+struct vfsmount *get_vfsmount_sb(struct super_block *sb)
+{
+	struct vfsmount *ret_vfs = NULL;
+	struct mount *mnt;
+	int ret = 0;
+
+	lock_mount_hash();
+	if (list_empty(&sb->s_mounts))
+		goto out;
+	mnt = list_entry(sb->s_mounts.next, struct mount, mnt_instance);
+	ret_vfs = &mnt->mnt;
+	ret_vfs = mntget(ret_vfs);
+out:
+	unlock_mount_hash();
+	return ret_vfs;
+}
+EXPORT_SYMBOL(get_vfsmount_sb);
+
 struct vfsmount *mnt_clone_internal(struct path *path)
 {
 	struct mount *p;
diff --git a/include/linux/mount.h b/include/linux/mount.h
index c2c561d..cf1b0f5 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -79,6 +79,7 @@ extern void mnt_drop_write_file(struct file *file);
 extern void mntput(struct vfsmount *mnt);
 extern struct vfsmount *mntget(struct vfsmount *mnt);
 extern struct vfsmount *mnt_clone_internal(struct path *path);
+extern struct vfsmount *get_vfsmount_sb(struct super_block *sb);
 extern int __mnt_is_readonly(struct vfsmount *mnt);
 
 struct path;
-- 
2.2.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RESEND v4 7/8] btrfs: Use mnt_want_write() to protect label change.
  2015-01-29  2:24 [PATCH RESEND v4 0/8] Fix freeze/sysfs deadlock in better method Qu Wenruo
                   ` (5 preceding siblings ...)
  2015-01-29  2:24 ` [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb Qu Wenruo
@ 2015-01-29  2:24 ` Qu Wenruo
  2015-01-29  2:24 ` [PATCH RESEND v4 8/8] btrfs: Use mnt_want_write() to protect sysfs feature change Qu Wenruo
  7 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2015-01-29  2:24 UTC (permalink / raw)
  To: linux-btrfs

Use the new vfs API get_vfsmount_sb() to get a vfsmount and use it to
do the write protection of the label change transaction.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/sysfs.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index b2e7bb4..cdfa6b9 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -25,6 +25,7 @@
 #include <linux/bug.h>
 #include <linux/genhd.h>
 #include <linux/debugfs.h>
+#include <linux/mount.h>
 
 #include "ctree.h"
 #include "disk-io.h"
@@ -374,11 +375,10 @@ static ssize_t btrfs_label_store(struct kobject *kobj,
 	struct btrfs_fs_info *fs_info = to_fs_info(kobj);
 	struct btrfs_trans_handle *trans;
 	struct btrfs_root *root = fs_info->fs_root;
+	struct vfsmount *vfsmount;
 	int ret;
 	size_t p_len;
 
-	if (fs_info->sb->s_flags & MS_RDONLY)
-		return -EROFS;
 
 	/*
 	 * p_len is the len until the first occurrence of either
@@ -389,9 +389,21 @@ static ssize_t btrfs_label_store(struct kobject *kobj,
 	if (p_len >= BTRFS_LABEL_SIZE)
 		return -EINVAL;
 
+	vfsmount = get_vfsmount_sb(fs_info->sb);
+	if (!vfsmount)
+		return -EINVAL;
+
+	ret = mnt_want_write(vfsmount);
+	if (ret) {
+		mntput(vfsmount);
+		return ret;
+	}
+
 	trans = btrfs_start_transaction(root, 0);
-	if (IS_ERR(trans))
-		return PTR_ERR(trans);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		goto out;
+	}
 
 	spin_lock(&root->fs_info->super_lock);
 	memset(fs_info->super_copy->label, 0, BTRFS_LABEL_SIZE);
@@ -399,9 +411,11 @@ static ssize_t btrfs_label_store(struct kobject *kobj,
 	spin_unlock(&root->fs_info->super_lock);
 	ret = btrfs_commit_transaction(trans, root);
 
+out:
+	mnt_drop_write(vfsmount);
+	mntput(vfsmount);
 	if (!ret)
 		return len;
-
 	return ret;
 }
 BTRFS_ATTR_RW(label, btrfs_label_show, btrfs_label_store);
-- 
2.2.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RESEND v4 8/8] btrfs: Use mnt_want_write() to protect sysfs feature change.
  2015-01-29  2:24 [PATCH RESEND v4 0/8] Fix freeze/sysfs deadlock in better method Qu Wenruo
                   ` (6 preceding siblings ...)
  2015-01-29  2:24 ` [PATCH RESEND v4 7/8] btrfs: Use mnt_want_write() to protect label change Qu Wenruo
@ 2015-01-29  2:24 ` Qu Wenruo
  7 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2015-01-29  2:24 UTC (permalink / raw)
  To: linux-btrfs

Just like label change, use mnt_want_write() to do a correct protection.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/sysfs.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index cdfa6b9..71e50ba 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -113,6 +113,7 @@ static ssize_t btrfs_feature_attr_store(struct kobject *kobj,
 	struct btrfs_fs_info *fs_info;
 	struct btrfs_feature_attr *fa = to_btrfs_feature_attr(a);
 	struct btrfs_trans_handle *trans;
+	struct vfsmount *vfsmount;
 	u64 features, set, clear;
 	unsigned long val;
 	int ret;
@@ -154,9 +155,20 @@ static ssize_t btrfs_feature_attr_store(struct kobject *kobj,
 	btrfs_info(fs_info, "%s %s feature flag",
 		   val ? "Setting" : "Clearing", fa->kobj_attr.attr.name);
 
+	vfsmount = get_vfsmount_sb(fs_info->sb);
+	if (!vfsmount)
+		return -EINVAL;
+
+	ret = mnt_want_write(vfsmount);
+	if (ret) {
+		mntput(vfsmount);
+		return ret;
+	}
 	trans = btrfs_start_transaction(fs_info->fs_root, 0);
-	if (IS_ERR(trans))
-		return PTR_ERR(trans);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		goto out;
+	}
 
 	spin_lock(&fs_info->super_lock);
 	features = get_features(fs_info, fa->feature_set);
@@ -168,10 +180,13 @@ static ssize_t btrfs_feature_attr_store(struct kobject *kobj,
 	spin_unlock(&fs_info->super_lock);
 
 	ret = btrfs_commit_transaction(trans, fs_info->fs_root);
-	if (ret)
-		return ret;
 
-	return count;
+out:
+	mnt_drop_write(vfsmount);
+	mntput(vfsmount);
+	if (!ret)
+		return count;
+	return ret;
 }
 
 static umode_t btrfs_feature_visible(struct kobject *kobj,
-- 
2.2.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
  2015-01-29  2:24 ` [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb Qu Wenruo
@ 2015-01-29 12:37   ` David Sterba
  2015-01-29 15:23     ` Al Viro
  2015-01-30  0:52   ` Miao Xie
  1 sibling, 1 reply; 40+ messages in thread
From: David Sterba @ 2015-01-29 12:37 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, linux-fsdevel, viro

Adding Al Viro into CC

On Thu, Jan 29, 2015 at 10:24:39AM +0800, Qu Wenruo wrote:
> +struct vfsmount *get_vfsmount_sb(struct super_block *sb)
> +{
> +	struct vfsmount *ret_vfs = NULL;
> +	struct mount *mnt;
> +	int ret = 0;
> +
> +	lock_mount_hash();
> +	if (list_empty(&sb->s_mounts))
> +		goto out;
> +	mnt = list_entry(sb->s_mounts.next, struct mount, mnt_instance);

from include/linux/fs.h:

struct super_block {
...
	struct list_head        s_mounts;       /* list of mounts; _not_ for fs use */
...
};

I hear a storm in the distance coming our direction ... so I'll
preemptively NAK this change.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
  2015-01-29 12:37   ` David Sterba
@ 2015-01-29 15:23     ` Al Viro
  2015-01-30  1:11         ` Qu Wenruo
  0 siblings, 1 reply; 40+ messages in thread
From: Al Viro @ 2015-01-29 15:23 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs, linux-fsdevel

On Thu, Jan 29, 2015 at 01:37:58PM +0100, David Sterba wrote:
> Adding Al Viro into CC
> 
> On Thu, Jan 29, 2015 at 10:24:39AM +0800, Qu Wenruo wrote:
> > +struct vfsmount *get_vfsmount_sb(struct super_block *sb)
> > +{
> > +	struct vfsmount *ret_vfs = NULL;
> > +	struct mount *mnt;
> > +	int ret = 0;
> > +
> > +	lock_mount_hash();
> > +	if (list_empty(&sb->s_mounts))
> > +		goto out;
> > +	mnt = list_entry(sb->s_mounts.next, struct mount, mnt_instance);
> 
> from include/linux/fs.h:
> 
> struct super_block {
> ...
> 	struct list_head        s_mounts;       /* list of mounts; _not_ for fs use */
> ...
> };
> 
> I hear a storm in the distance coming our direction ... so I'll
> preemptively NAK this change.

Could you explain what the devil is that for?  The primitive looks rather
bogus - if nothing else, it includes "make random instance of the filesystem
in someone's namespace appear busy to umount", which doesn't look like a
part of useful interface...  The only piece of context I'd been able to find
was something vague about sysfs-inflicted operations and wanting to use
mnt_want_write() but having nothing to pass it; BTW, what if the (random)
instance you run into happens to mounted r/o?

Assuming that your superblock is guaranteed to stay alive and usable for
whatever work you are trying to do, what's wrong with sb_want_write()?  

If it's _not_ guaranteed to stay so, and this is what you are trying to
solve, you are doing that at the wrong level - just take sysfs entry
removals earlier in shutdown process and be done with that.  Beginning of
close_ctree() would probably be early enough to be safe, but if that's
not enough, you can take it into the beginning of btrfs_kill_super().

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount option in a atomic way
  2015-01-29  2:24 ` [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount option in a atomic way Qu Wenruo
@ 2015-01-30  0:31   ` Miao Xie
  2015-01-30  1:20     ` Qu Wenruo
  0 siblings, 1 reply; 40+ messages in thread
From: Miao Xie @ 2015-01-30  0:31 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On Thu, 29 Jan 2015 10:24:35 +0800, Qu Wenruo wrote:
> Current btrfs_parse_options() is not atomic, which can set and clear a
> bit, especially for nospace_cache case.
> 
> For example, if a fs is mounted with nospace_cache,
> btrfs_parse_options() will set SPACE_CACHE bit first(since
> cache_generation is non-zeo) and clear the SPACE_CACHE bit due to
> nospace_cache mount option.
> So under heavy operations and remount a nospace_cache btrfs, there is a
> windows for commit to create space cache.
> 
> This bug can be reproduced by fstest/btrfs/071 073 074 with
> nospace_cache mount option. It has about 50% chance to create space
> cache, and about 10% chance to create wrong space cache, which can't
> pass btrfsck.
> 
> This patch will do the mount option parse in a copy-and-update method.
> First copy the mount_opt from fs_info to new_opt, and only update
> options in new_opt. At last, copy the new_opt back to
> fs_info->mount_opt.
> 
> This patch is already good enough to fix the above nospace_cache +
> remount bug, but need later patch to make sure mount options does not
> change during transaction.
> 
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
>  fs/btrfs/ctree.h |  16 ++++----
>  fs/btrfs/super.c | 115 +++++++++++++++++++++++++++++--------------------------
>  2 files changed, 69 insertions(+), 62 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 5f99743..26bb47b 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2119,18 +2119,18 @@ struct btrfs_ioctl_defrag_range_args {
>  #define btrfs_test_opt(root, opt)	((root)->fs_info->mount_opt & \
>  					 BTRFS_MOUNT_##opt)
>  
> -#define btrfs_set_and_info(root, opt, fmt, args...)			\
> +#define btrfs_set_and_info(fs_info, val, opt, fmt, args...)		\
>  {									\
> -	if (!btrfs_test_opt(root, opt))					\
> -		btrfs_info(root->fs_info, fmt, ##args);			\
> -	btrfs_set_opt(root->fs_info->mount_opt, opt);			\
> +	if (!btrfs_raw_test_opt(val, opt))				\
> +		btrfs_info(fs_info, fmt, ##args);			\
> +	btrfs_set_opt(val, opt);					\
>  }
>  
> -#define btrfs_clear_and_info(root, opt, fmt, args...)			\
> +#define btrfs_clear_and_info(fs_info, val, opt, fmt, args...)		\
>  {									\
> -	if (btrfs_test_opt(root, opt))					\
> -		btrfs_info(root->fs_info, fmt, ##args);			\
> -	btrfs_clear_opt(root->fs_info->mount_opt, opt);			\
> +	if (btrfs_raw_test_opt(val, opt))				\
> +		btrfs_info(fs_info, fmt, ##args);			\
> +	btrfs_clear_opt(val, opt);					\
>  }
>  
>  /*
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index b0c45b2..490fe1f 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -395,10 +395,13 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
>  	int ret = 0;
>  	char *compress_type;
>  	bool compress_force = false;
> +	unsigned long new_opt;
> +
> +	new_opt = info->mount_opt;

Here and

>  
>  	cache_gen = btrfs_super_cache_generation(root->fs_info->super_copy);
>  	if (cache_gen)
> -		btrfs_set_opt(info->mount_opt, SPACE_CACHE);
[SNIP]
>  out:
> -	if (!ret && btrfs_test_opt(root, SPACE_CACHE))
> -		btrfs_info(root->fs_info, "disk space caching is enabled");
> +	if (!ret) {
> +		if (btrfs_raw_test_opt(new_opt, SPACE_CACHE))
> +			btrfs_info(info, "disk space caching is enabled");
> +		info->mount_opt = new_opt;

Here need ACCESS_ONCE to wrap info->mount_opt, or the complier might use
info->mount_opt instead of new_opt.

But I worried that this is not key reason of the wrong space cache. Could
you explain the race condition which caused the wrong space cache?

Thanks
Miao

> +	}
>  	kfree(orig);
>  	return ret;
>  }
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
  2015-01-29  2:24 ` [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb Qu Wenruo
  2015-01-29 12:37   ` David Sterba
@ 2015-01-30  0:52   ` Miao Xie
  2015-01-30  1:44       ` Qu Wenruo
  1 sibling, 1 reply; 40+ messages in thread
From: Miao Xie @ 2015-01-30  0:52 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: linux-fsdevel

On Thu, 29 Jan 2015 10:24:39 +0800, Qu Wenruo wrote:
> There are sysfs interfaces in some fs, only btrfs yet, which will modify
> on-disk data.
> Unlike normal file operation routine we can use mnt_want_write_file() to
> protect the operation, change through sysfs won't to be binded to any file
> in the filesystem.
> So we can only extract the first vfsmount of a superblock and pass it to
> mnt_want_write() to do the protection.

This method is wrong, becasue one fs  may be mounted on the multi places
at the same time, someone is R/O, someone is R/W, you may get a R/O and
fail to get the write permission.

I think you do label/feature change by sysfs interface by the following way

btrfs_sysfs_change_XXXX()
{
	/* Use trylock to avoid the race with umount */
	if(!mutex_trylock(&sb->s_umount))
		return -EBUSY;

	check R/O and FREEZE

	mutex_unlock(&sb->s_umount);
}

Thanks
Miao

> 
> Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
>  fs/namespace.c        | 25 +++++++++++++++++++++++++
>  include/linux/mount.h |  1 +
>  2 files changed, 26 insertions(+)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index cd1e968..5a16a62 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1105,6 +1105,31 @@ struct vfsmount *mntget(struct vfsmount *mnt)
>  }
>  EXPORT_SYMBOL(mntget);
>  
> +/*
> + * get a vfsmount from a given sb
> + *
> + * This is especially used for case where change fs' sysfs interface
> + * will lead to a write, e.g. Change label through sysfs in btrfs.
> + * So vfs can get a vfsmount and then use mnt_want_write() to protect.
> + */
> +struct vfsmount *get_vfsmount_sb(struct super_block *sb)
> +{
> +	struct vfsmount *ret_vfs = NULL;
> +	struct mount *mnt;
> +	int ret = 0;
> +
> +	lock_mount_hash();
> +	if (list_empty(&sb->s_mounts))
> +		goto out;
> +	mnt = list_entry(sb->s_mounts.next, struct mount, mnt_instance);
> +	ret_vfs = &mnt->mnt;
> +	ret_vfs = mntget(ret_vfs);
> +out:
> +	unlock_mount_hash();
> +	return ret_vfs;
> +}
> +EXPORT_SYMBOL(get_vfsmount_sb);
> +
>  struct vfsmount *mnt_clone_internal(struct path *path)
>  {
>  	struct mount *p;
> diff --git a/include/linux/mount.h b/include/linux/mount.h
> index c2c561d..cf1b0f5 100644
> --- a/include/linux/mount.h
> +++ b/include/linux/mount.h
> @@ -79,6 +79,7 @@ extern void mnt_drop_write_file(struct file *file);
>  extern void mntput(struct vfsmount *mnt);
>  extern struct vfsmount *mntget(struct vfsmount *mnt);
>  extern struct vfsmount *mnt_clone_internal(struct path *path);
> +extern struct vfsmount *get_vfsmount_sb(struct super_block *sb);
>  extern int __mnt_is_readonly(struct vfsmount *mnt);
>  
>  struct path;
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
  2015-01-29 15:23     ` Al Viro
@ 2015-01-30  1:11         ` Qu Wenruo
  0 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2015-01-30  1:11 UTC (permalink / raw)
  To: Al Viro, dsterba, linux-btrfs, linux-fsdevel


-------- Original Message --------
Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function 
to get vfsmount from a given sb.
From: Al Viro <viro@ZenIV.linux.org.uk>
To: <dsterba@suse.cz>, Qu Wenruo <quwenruo@cn.fujitsu.com>, 
<linux-btrfs@vger.kernel.org>, linux-fsdevel <linux-fsdevel@vger.kernel.org>
Date: 2015年01月29日 23:23
> On Thu, Jan 29, 2015 at 01:37:58PM +0100, David Sterba wrote:
>> Adding Al Viro into CC
>>
>> On Thu, Jan 29, 2015 at 10:24:39AM +0800, Qu Wenruo wrote:
>>> +struct vfsmount *get_vfsmount_sb(struct super_block *sb)
>>> +{
>>> +	struct vfsmount *ret_vfs = NULL;
>>> +	struct mount *mnt;
>>> +	int ret = 0;
>>> +
>>> +	lock_mount_hash();
>>> +	if (list_empty(&sb->s_mounts))
>>> +		goto out;
>>> +	mnt = list_entry(sb->s_mounts.next, struct mount, mnt_instance);
>> from include/linux/fs.h:
>>
>> struct super_block {
>> ...
>> 	struct list_head        s_mounts;       /* list of mounts; _not_ for fs use */
>> ...
>> };
>>
>> I hear a storm in the distance coming our direction ... so I'll
>> preemptively NAK this change.
> Could you explain what the devil is that for?
In sysfs interface handler, we don't have struct file or vfsmount to do 
the mnt_want_write* protection,
so this function is introduced to get a vfsmount and do the protection.
> The primitive looks rather
> bogus - if nothing else, it includes "make random instance of the filesystem
> in someone's namespace appear busy to umount", which doesn't look like a
> part of useful interface...
This is the problem I didn't find a good way to avoid.

If using the function, it will always use the first(or last?) vfsmount 
of the filesystem.
For 1 to 1 match case, it's OK, but if one device is mounted on multiple 
mount points,
it will delay the umount of that mount point. But we only need to keep 
at least one rw mount point exists.
>    The only piece of context I'd been able to find
> was something vague about sysfs-inflicted operations and wanting to use
> mnt_want_write() but having nothing to pass it; BTW, what if the (random)
> instance you run into happens to mounted r/o?
For the mounted ro case, that's not a problem, since if one instance is 
mounted ro,
other instances are also mounted ro, so that's not a problem.

>
> Assuming that your superblock is guaranteed to stay alive and usable for
> whatever work you are trying to do, what's wrong with sb_want_write()?
Did you mean change the function name and it's parameter to 
sb_want_write(sb) and sb_drop_write(sb).
That looks much better.
But I'm a little worried about just using sb_start_write() and 
s_readonly_remount/s_flags to do the protection,
will it be enough?

Thanks,
Qu
>   
>
> If it's _not_ guaranteed to stay so, and this is what you are trying to
> solve, you are doing that at the wrong level - just take sysfs entry
> removals earlier in shutdown process and be done with that.  Beginning of
> close_ctree() would probably be early enough to be safe, but if that's
> not enough, you can take it into the beginning of btrfs_kill_super().


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
@ 2015-01-30  1:11         ` Qu Wenruo
  0 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2015-01-30  1:11 UTC (permalink / raw)
  To: Al Viro, dsterba, linux-btrfs, linux-fsdevel


-------- Original Message --------
Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function 
to get vfsmount from a given sb.
From: Al Viro <viro@ZenIV.linux.org.uk>
To: <dsterba@suse.cz>, Qu Wenruo <quwenruo@cn.fujitsu.com>, 
<linux-btrfs@vger.kernel.org>, linux-fsdevel <linux-fsdevel@vger.kernel.org>
Date: 2015年01月29日 23:23
> On Thu, Jan 29, 2015 at 01:37:58PM +0100, David Sterba wrote:
>> Adding Al Viro into CC
>>
>> On Thu, Jan 29, 2015 at 10:24:39AM +0800, Qu Wenruo wrote:
>>> +struct vfsmount *get_vfsmount_sb(struct super_block *sb)
>>> +{
>>> +	struct vfsmount *ret_vfs = NULL;
>>> +	struct mount *mnt;
>>> +	int ret = 0;
>>> +
>>> +	lock_mount_hash();
>>> +	if (list_empty(&sb->s_mounts))
>>> +		goto out;
>>> +	mnt = list_entry(sb->s_mounts.next, struct mount, mnt_instance);
>> from include/linux/fs.h:
>>
>> struct super_block {
>> ...
>> 	struct list_head        s_mounts;       /* list of mounts; _not_ for fs use */
>> ...
>> };
>>
>> I hear a storm in the distance coming our direction ... so I'll
>> preemptively NAK this change.
> Could you explain what the devil is that for?
In sysfs interface handler, we don't have struct file or vfsmount to do 
the mnt_want_write* protection,
so this function is introduced to get a vfsmount and do the protection.
> The primitive looks rather
> bogus - if nothing else, it includes "make random instance of the filesystem
> in someone's namespace appear busy to umount", which doesn't look like a
> part of useful interface...
This is the problem I didn't find a good way to avoid.

If using the function, it will always use the first(or last?) vfsmount 
of the filesystem.
For 1 to 1 match case, it's OK, but if one device is mounted on multiple 
mount points,
it will delay the umount of that mount point. But we only need to keep 
at least one rw mount point exists.
>    The only piece of context I'd been able to find
> was something vague about sysfs-inflicted operations and wanting to use
> mnt_want_write() but having nothing to pass it; BTW, what if the (random)
> instance you run into happens to mounted r/o?
For the mounted ro case, that's not a problem, since if one instance is 
mounted ro,
other instances are also mounted ro, so that's not a problem.

>
> Assuming that your superblock is guaranteed to stay alive and usable for
> whatever work you are trying to do, what's wrong with sb_want_write()?
Did you mean change the function name and it's parameter to 
sb_want_write(sb) and sb_drop_write(sb).
That looks much better.
But I'm a little worried about just using sb_start_write() and 
s_readonly_remount/s_flags to do the protection,
will it be enough?

Thanks,
Qu
>   
>
> If it's _not_ guaranteed to stay so, and this is what you are trying to
> solve, you are doing that at the wrong level - just take sysfs entry
> removals earlier in shutdown process and be done with that.  Beginning of
> close_ctree() would probably be early enough to be safe, but if that's
> not enough, you can take it into the beginning of btrfs_kill_super().

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount option in a atomic way
  2015-01-30  0:31   ` Miao Xie
@ 2015-01-30  1:20     ` Qu Wenruo
  2015-01-30  1:29       ` Miao Xie
  0 siblings, 1 reply; 40+ messages in thread
From: Qu Wenruo @ 2015-01-30  1:20 UTC (permalink / raw)
  To: Miao Xie, linux-btrfs


-------- Original Message --------
Subject: Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() 
parse mount option in a atomic way
From: Miao Xie <miaoxie@huawei.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>, <linux-btrfs@vger.kernel.org>
Date: 2015年01月30日 08:31
> On Thu, 29 Jan 2015 10:24:35 +0800, Qu Wenruo wrote:
>> Current btrfs_parse_options() is not atomic, which can set and clear a
>> bit, especially for nospace_cache case.
>>
>> For example, if a fs is mounted with nospace_cache,
>> btrfs_parse_options() will set SPACE_CACHE bit first(since
>> cache_generation is non-zeo) and clear the SPACE_CACHE bit due to
>> nospace_cache mount option.
>> So under heavy operations and remount a nospace_cache btrfs, there is a
>> windows for commit to create space cache.
>>
>> This bug can be reproduced by fstest/btrfs/071 073 074 with
>> nospace_cache mount option. It has about 50% chance to create space
>> cache, and about 10% chance to create wrong space cache, which can't
>> pass btrfsck.
>>
>> This patch will do the mount option parse in a copy-and-update method.
>> First copy the mount_opt from fs_info to new_opt, and only update
>> options in new_opt. At last, copy the new_opt back to
>> fs_info->mount_opt.
>>
>> This patch is already good enough to fix the above nospace_cache +
>> remount bug, but need later patch to make sure mount options does not
>> change during transaction.
>>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> ---
>>   fs/btrfs/ctree.h |  16 ++++----
>>   fs/btrfs/super.c | 115 +++++++++++++++++++++++++++++--------------------------
>>   2 files changed, 69 insertions(+), 62 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 5f99743..26bb47b 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -2119,18 +2119,18 @@ struct btrfs_ioctl_defrag_range_args {
>>   #define btrfs_test_opt(root, opt)	((root)->fs_info->mount_opt & \
>>   					 BTRFS_MOUNT_##opt)
>>   
>> -#define btrfs_set_and_info(root, opt, fmt, args...)			\
>> +#define btrfs_set_and_info(fs_info, val, opt, fmt, args...)		\
>>   {									\
>> -	if (!btrfs_test_opt(root, opt))					\
>> -		btrfs_info(root->fs_info, fmt, ##args);			\
>> -	btrfs_set_opt(root->fs_info->mount_opt, opt);			\
>> +	if (!btrfs_raw_test_opt(val, opt))				\
>> +		btrfs_info(fs_info, fmt, ##args);			\
>> +	btrfs_set_opt(val, opt);					\
>>   }
>>   
>> -#define btrfs_clear_and_info(root, opt, fmt, args...)			\
>> +#define btrfs_clear_and_info(fs_info, val, opt, fmt, args...)		\
>>   {									\
>> -	if (btrfs_test_opt(root, opt))					\
>> -		btrfs_info(root->fs_info, fmt, ##args);			\
>> -	btrfs_clear_opt(root->fs_info->mount_opt, opt);			\
>> +	if (btrfs_raw_test_opt(val, opt))				\
>> +		btrfs_info(fs_info, fmt, ##args);			\
>> +	btrfs_clear_opt(val, opt);					\
>>   }
>>   
>>   /*
>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>> index b0c45b2..490fe1f 100644
>> --- a/fs/btrfs/super.c
>> +++ b/fs/btrfs/super.c
>> @@ -395,10 +395,13 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
>>   	int ret = 0;
>>   	char *compress_type;
>>   	bool compress_force = false;
>> +	unsigned long new_opt;
>> +
>> +	new_opt = info->mount_opt;
> Here and
>
>>   
>>   	cache_gen = btrfs_super_cache_generation(root->fs_info->super_copy);
>>   	if (cache_gen)
>> -		btrfs_set_opt(info->mount_opt, SPACE_CACHE);
> [SNIP]
>>   out:
>> -	if (!ret && btrfs_test_opt(root, SPACE_CACHE))
>> -		btrfs_info(root->fs_info, "disk space caching is enabled");
>> +	if (!ret) {
>> +		if (btrfs_raw_test_opt(new_opt, SPACE_CACHE))
>> +			btrfs_info(info, "disk space caching is enabled");
>> +		info->mount_opt = new_opt;
> Here need ACCESS_ONCE to wrap info->mount_opt, or the complier might use
> info->mount_opt instead of new_opt.
Thanks for pointing out this one.
>
> But I worried that this is not key reason of the wrong space cache. Could
> you explain the race condition which caused the wrong space cache?
>
> Thanks
> Miao
CPU0:
remount()
|- sync_fs() <- after sync_fs() we can start new trans
|- btrfs_parse_options()     CPU1:
                     |- start_transaction()
                     |- Do some bg allocation, not recorded in space_cache.
      |- set SPACE_CACHE bit due to cache_gen

                     |- commit_transaction()
                         |- write space cache and update cache_gen.
                             but since some of it is not recorded in 
space cache,
                             the space cache missing some records.
      |- clear SPACE_CACHE bit dut to nospace_cache

So the space cache is wrong.

Thanks,
Qu
>
>> +	}
>>   	kfree(orig);
>>   	return ret;
>>   }
>>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount option in a atomic way
  2015-01-30  1:20     ` Qu Wenruo
@ 2015-01-30  1:29       ` Miao Xie
  2015-01-30  1:33         ` Qu Wenruo
  0 siblings, 1 reply; 40+ messages in thread
From: Miao Xie @ 2015-01-30  1:29 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On Fri, 30 Jan 2015 09:20:46 +0800, Qu Wenruo wrote:
>> Here need ACCESS_ONCE to wrap info->mount_opt, or the complier might use
>> info->mount_opt instead of new_opt.
> Thanks for pointing out this one.
>>
>> But I worried that this is not key reason of the wrong space cache. Could
>> you explain the race condition which caused the wrong space cache?
>>
>> Thanks
>> Miao
> CPU0:
> remount()
> |- sync_fs() <- after sync_fs() we can start new trans
> |- btrfs_parse_options()     CPU1:
>                     |- start_transaction()
>                     |- Do some bg allocation, not recorded in space_cache.

I think it is a bug if a free space is not recorded in space cache. Could you
explain why it is not recorded?

Thanks
Miao

>      |- set SPACE_CACHE bit due to cache_gen
> 
>                     |- commit_transaction()
>                         |- write space cache and update cache_gen.
>                             but since some of it is not recorded in space cache,
>                             the space cache missing some records.
>      |- clear SPACE_CACHE bit dut to nospace_cache
> 
> So the space cache is wrong.
> 
> Thanks,
> Qu
>>
>>> +    }
>>>       kfree(orig);
>>>       return ret;
>>>   }
>>>
> 
> .
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount option in a atomic way
  2015-01-30  1:29       ` Miao Xie
@ 2015-01-30  1:33         ` Qu Wenruo
  2015-01-30  2:06           ` Miao Xie
  0 siblings, 1 reply; 40+ messages in thread
From: Qu Wenruo @ 2015-01-30  1:33 UTC (permalink / raw)
  To: Miao Xie, linux-btrfs


-------- Original Message --------
Subject: Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() 
parse mount option in a atomic way
From: Miao Xie <miaoxie@huawei.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>, <linux-btrfs@vger.kernel.org>
Date: 2015年01月30日 09:29
> On Fri, 30 Jan 2015 09:20:46 +0800, Qu Wenruo wrote:
>>> Here need ACCESS_ONCE to wrap info->mount_opt, or the complier might use
>>> info->mount_opt instead of new_opt.
>> Thanks for pointing out this one.
>>> But I worried that this is not key reason of the wrong space cache. Could
>>> you explain the race condition which caused the wrong space cache?
>>>
>>> Thanks
>>> Miao
>> CPU0:
>> remount()
>> |- sync_fs() <- after sync_fs() we can start new trans
>> |- btrfs_parse_options()     CPU1:
>>                      |- start_transaction()
>>                      |- Do some bg allocation, not recorded in space_cache.
> I think it is a bug if a free space is not recorded in space cache. Could you
> explain why it is not recorded?
>
> Thanks
> Miao
IIRC, in that window, the fs_info->mount_opt's SPACE_CACHE bit is cleared.
So space cache is not recorded.

Thanks,
Qu
>
>>       |- set SPACE_CACHE bit due to cache_gen
>>
>>                      |- commit_transaction()
>>                          |- write space cache and update cache_gen.
>>                              but since some of it is not recorded in space cache,
>>                              the space cache missing some records.
>>       |- clear SPACE_CACHE bit dut to nospace_cache
>>
>> So the space cache is wrong.
>>
>> Thanks,
>> Qu
>>>> +    }
>>>>        kfree(orig);
>>>>        return ret;
>>>>    }
>>>>
>> .
>>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
  2015-01-30  0:52   ` Miao Xie
@ 2015-01-30  1:44       ` Qu Wenruo
  0 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2015-01-30  1:44 UTC (permalink / raw)
  To: Miao Xie, linux-btrfs; +Cc: linux-fsdevel


-------- Original Message --------
Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function 
to get vfsmount from a given sb.
From: Miao Xie <miaoxie@huawei.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>, <linux-btrfs@vger.kernel.org>
Date: 2015年01月30日 08:52
> On Thu, 29 Jan 2015 10:24:39 +0800, Qu Wenruo wrote:
>> There are sysfs interfaces in some fs, only btrfs yet, which will modify
>> on-disk data.
>> Unlike normal file operation routine we can use mnt_want_write_file() to
>> protect the operation, change through sysfs won't to be binded to any file
>> in the filesystem.
>> So we can only extract the first vfsmount of a superblock and pass it to
>> mnt_want_write() to do the protection.
> This method is wrong, becasue one fs  may be mounted on the multi places
> at the same time, someone is R/O, someone is R/W, you may get a R/O and
> fail to get the write permission.
This shouldn't happen. If someone is ro, the whole fs should be ro, right?
You can mount a device which is already mounted as rw to other point as ro,
and remount a mount point to ro will also cause all other mount point to ro.

So I didn't see the problem here.
>
> I think you do label/feature change by sysfs interface by the following way
>
> btrfs_sysfs_change_XXXX()
> {
> 	/* Use trylock to avoid the race with umount */
> 	if(!mutex_trylock(&sb->s_umount))
> 		return -EBUSY;
>
> 	check R/O and FREEZE
>
> 	mutex_unlock(&sb->s_umount);
> }
This looks better since it not introduce changes to VFS.

Thanks,
Qu
>
> Thanks
> Miao
>
>> Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> ---
>>   fs/namespace.c        | 25 +++++++++++++++++++++++++
>>   include/linux/mount.h |  1 +
>>   2 files changed, 26 insertions(+)
>>
>> diff --git a/fs/namespace.c b/fs/namespace.c
>> index cd1e968..5a16a62 100644
>> --- a/fs/namespace.c
>> +++ b/fs/namespace.c
>> @@ -1105,6 +1105,31 @@ struct vfsmount *mntget(struct vfsmount *mnt)
>>   }
>>   EXPORT_SYMBOL(mntget);
>>   
>> +/*
>> + * get a vfsmount from a given sb
>> + *
>> + * This is especially used for case where change fs' sysfs interface
>> + * will lead to a write, e.g. Change label through sysfs in btrfs.
>> + * So vfs can get a vfsmount and then use mnt_want_write() to protect.
>> + */
>> +struct vfsmount *get_vfsmount_sb(struct super_block *sb)
>> +{
>> +	struct vfsmount *ret_vfs = NULL;
>> +	struct mount *mnt;
>> +	int ret = 0;
>> +
>> +	lock_mount_hash();
>> +	if (list_empty(&sb->s_mounts))
>> +		goto out;
>> +	mnt = list_entry(sb->s_mounts.next, struct mount, mnt_instance);
>> +	ret_vfs = &mnt->mnt;
>> +	ret_vfs = mntget(ret_vfs);
>> +out:
>> +	unlock_mount_hash();
>> +	return ret_vfs;
>> +}
>> +EXPORT_SYMBOL(get_vfsmount_sb);
>> +
>>   struct vfsmount *mnt_clone_internal(struct path *path)
>>   {
>>   	struct mount *p;
>> diff --git a/include/linux/mount.h b/include/linux/mount.h
>> index c2c561d..cf1b0f5 100644
>> --- a/include/linux/mount.h
>> +++ b/include/linux/mount.h
>> @@ -79,6 +79,7 @@ extern void mnt_drop_write_file(struct file *file);
>>   extern void mntput(struct vfsmount *mnt);
>>   extern struct vfsmount *mntget(struct vfsmount *mnt);
>>   extern struct vfsmount *mnt_clone_internal(struct path *path);
>> +extern struct vfsmount *get_vfsmount_sb(struct super_block *sb);
>>   extern int __mnt_is_readonly(struct vfsmount *mnt);
>>   
>>   struct path;
>>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
@ 2015-01-30  1:44       ` Qu Wenruo
  0 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2015-01-30  1:44 UTC (permalink / raw)
  To: Miao Xie, linux-btrfs; +Cc: linux-fsdevel


-------- Original Message --------
Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function 
to get vfsmount from a given sb.
From: Miao Xie <miaoxie@huawei.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>, <linux-btrfs@vger.kernel.org>
Date: 2015年01月30日 08:52
> On Thu, 29 Jan 2015 10:24:39 +0800, Qu Wenruo wrote:
>> There are sysfs interfaces in some fs, only btrfs yet, which will modify
>> on-disk data.
>> Unlike normal file operation routine we can use mnt_want_write_file() to
>> protect the operation, change through sysfs won't to be binded to any file
>> in the filesystem.
>> So we can only extract the first vfsmount of a superblock and pass it to
>> mnt_want_write() to do the protection.
> This method is wrong, becasue one fs  may be mounted on the multi places
> at the same time, someone is R/O, someone is R/W, you may get a R/O and
> fail to get the write permission.
This shouldn't happen. If someone is ro, the whole fs should be ro, right?
You can mount a device which is already mounted as rw to other point as ro,
and remount a mount point to ro will also cause all other mount point to ro.

So I didn't see the problem here.
>
> I think you do label/feature change by sysfs interface by the following way
>
> btrfs_sysfs_change_XXXX()
> {
> 	/* Use trylock to avoid the race with umount */
> 	if(!mutex_trylock(&sb->s_umount))
> 		return -EBUSY;
>
> 	check R/O and FREEZE
>
> 	mutex_unlock(&sb->s_umount);
> }
This looks better since it not introduce changes to VFS.

Thanks,
Qu
>
> Thanks
> Miao
>
>> Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> ---
>>   fs/namespace.c        | 25 +++++++++++++++++++++++++
>>   include/linux/mount.h |  1 +
>>   2 files changed, 26 insertions(+)
>>
>> diff --git a/fs/namespace.c b/fs/namespace.c
>> index cd1e968..5a16a62 100644
>> --- a/fs/namespace.c
>> +++ b/fs/namespace.c
>> @@ -1105,6 +1105,31 @@ struct vfsmount *mntget(struct vfsmount *mnt)
>>   }
>>   EXPORT_SYMBOL(mntget);
>>   
>> +/*
>> + * get a vfsmount from a given sb
>> + *
>> + * This is especially used for case where change fs' sysfs interface
>> + * will lead to a write, e.g. Change label through sysfs in btrfs.
>> + * So vfs can get a vfsmount and then use mnt_want_write() to protect.
>> + */
>> +struct vfsmount *get_vfsmount_sb(struct super_block *sb)
>> +{
>> +	struct vfsmount *ret_vfs = NULL;
>> +	struct mount *mnt;
>> +	int ret = 0;
>> +
>> +	lock_mount_hash();
>> +	if (list_empty(&sb->s_mounts))
>> +		goto out;
>> +	mnt = list_entry(sb->s_mounts.next, struct mount, mnt_instance);
>> +	ret_vfs = &mnt->mnt;
>> +	ret_vfs = mntget(ret_vfs);
>> +out:
>> +	unlock_mount_hash();
>> +	return ret_vfs;
>> +}
>> +EXPORT_SYMBOL(get_vfsmount_sb);
>> +
>>   struct vfsmount *mnt_clone_internal(struct path *path)
>>   {
>>   	struct mount *p;
>> diff --git a/include/linux/mount.h b/include/linux/mount.h
>> index c2c561d..cf1b0f5 100644
>> --- a/include/linux/mount.h
>> +++ b/include/linux/mount.h
>> @@ -79,6 +79,7 @@ extern void mnt_drop_write_file(struct file *file);
>>   extern void mntput(struct vfsmount *mnt);
>>   extern struct vfsmount *mntget(struct vfsmount *mnt);
>>   extern struct vfsmount *mnt_clone_internal(struct path *path);
>> +extern struct vfsmount *get_vfsmount_sb(struct super_block *sb);
>>   extern int __mnt_is_readonly(struct vfsmount *mnt);
>>   
>>   struct path;
>>

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
  2015-01-30  1:44       ` Qu Wenruo
@ 2015-01-30  2:02         ` Qu Wenruo
  -1 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2015-01-30  2:02 UTC (permalink / raw)
  To: Miao Xie, linux-btrfs; +Cc: linux-fsdevel


-------- Original Message --------
Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function 
to get vfsmount from a given sb.
From: Qu Wenruo <quwenruo@cn.fujitsu.com>
To: Miao Xie <miaoxie@huawei.com>, linux-btrfs@vger.kernel.org
Date: 2015年01月30日 09:44
>
> -------- Original Message --------
> Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function 
> to get vfsmount from a given sb.
> From: Miao Xie <miaoxie@huawei.com>
> To: Qu Wenruo <quwenruo@cn.fujitsu.com>, <linux-btrfs@vger.kernel.org>
> Date: 2015年01月30日 08:52
>> On Thu, 29 Jan 2015 10:24:39 +0800, Qu Wenruo wrote:
>>> There are sysfs interfaces in some fs, only btrfs yet, which will 
>>> modify
>>> on-disk data.
>>> Unlike normal file operation routine we can use 
>>> mnt_want_write_file() to
>>> protect the operation, change through sysfs won't to be binded to 
>>> any file
>>> in the filesystem.
>>> So we can only extract the first vfsmount of a superblock and pass 
>>> it to
>>> mnt_want_write() to do the protection.
>> This method is wrong, becasue one fs  may be mounted on the multi places
>> at the same time, someone is R/O, someone is R/W, you may get a R/O and
>> fail to get the write permission.
> This shouldn't happen. If someone is ro, the whole fs should be ro, 
> right?
> You can mount a device which is already mounted as rw to other point 
> as ro,
> and remount a mount point to ro will also cause all other mount point 
> to ro.
>
> So I didn't see the problem here.
>>
>> I think you do label/feature change by sysfs interface by the 
>> following way
>>
>> btrfs_sysfs_change_XXXX()
>> {
>>     /* Use trylock to avoid the race with umount */
>>     if(!mutex_trylock(&sb->s_umount))
>>         return -EBUSY;
>>
>>     check R/O and FREEZE
>>
>>     mutex_unlock(&sb->s_umount);
>> }
> This looks better since it not introduce changes to VFS.
>
> Thanks,
> Qu
Oh, wait a second, this one leads to the old problem and old solution.

If we hold s_umount mutex, we must do freeze check and can't start 
transaction since it will deadlock.

And for freeze check, we must use sb_try_start_intwrite() to hold the 
freeze lock and then add a new
btrfs_start_transaction_freeze() which will not call sb_start_write()...

Oh this seems so similar, v2 or v3 version RFC patch?
So still goes to the old method?

Thanks,
Qu
>>
>> Thanks
>> Miao
>>
>>> Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>
>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>> ---
>>>   fs/namespace.c        | 25 +++++++++++++++++++++++++
>>>   include/linux/mount.h |  1 +
>>>   2 files changed, 26 insertions(+)
>>>
>>> diff --git a/fs/namespace.c b/fs/namespace.c
>>> index cd1e968..5a16a62 100644
>>> --- a/fs/namespace.c
>>> +++ b/fs/namespace.c
>>> @@ -1105,6 +1105,31 @@ struct vfsmount *mntget(struct vfsmount *mnt)
>>>   }
>>>   EXPORT_SYMBOL(mntget);
>>>   +/*
>>> + * get a vfsmount from a given sb
>>> + *
>>> + * This is especially used for case where change fs' sysfs interface
>>> + * will lead to a write, e.g. Change label through sysfs in btrfs.
>>> + * So vfs can get a vfsmount and then use mnt_want_write() to protect.
>>> + */
>>> +struct vfsmount *get_vfsmount_sb(struct super_block *sb)
>>> +{
>>> +    struct vfsmount *ret_vfs = NULL;
>>> +    struct mount *mnt;
>>> +    int ret = 0;
>>> +
>>> +    lock_mount_hash();
>>> +    if (list_empty(&sb->s_mounts))
>>> +        goto out;
>>> +    mnt = list_entry(sb->s_mounts.next, struct mount, mnt_instance);
>>> +    ret_vfs = &mnt->mnt;
>>> +    ret_vfs = mntget(ret_vfs);
>>> +out:
>>> +    unlock_mount_hash();
>>> +    return ret_vfs;
>>> +}
>>> +EXPORT_SYMBOL(get_vfsmount_sb);
>>> +
>>>   struct vfsmount *mnt_clone_internal(struct path *path)
>>>   {
>>>       struct mount *p;
>>> diff --git a/include/linux/mount.h b/include/linux/mount.h
>>> index c2c561d..cf1b0f5 100644
>>> --- a/include/linux/mount.h
>>> +++ b/include/linux/mount.h
>>> @@ -79,6 +79,7 @@ extern void mnt_drop_write_file(struct file *file);
>>>   extern void mntput(struct vfsmount *mnt);
>>>   extern struct vfsmount *mntget(struct vfsmount *mnt);
>>>   extern struct vfsmount *mnt_clone_internal(struct path *path);
>>> +extern struct vfsmount *get_vfsmount_sb(struct super_block *sb);
>>>   extern int __mnt_is_readonly(struct vfsmount *mnt);
>>>     struct path;
>>>
>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
@ 2015-01-30  2:02         ` Qu Wenruo
  0 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2015-01-30  2:02 UTC (permalink / raw)
  To: Miao Xie, linux-btrfs; +Cc: linux-fsdevel


-------- Original Message --------
Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function 
to get vfsmount from a given sb.
From: Qu Wenruo <quwenruo@cn.fujitsu.com>
To: Miao Xie <miaoxie@huawei.com>, linux-btrfs@vger.kernel.org
Date: 2015年01月30日 09:44
>
> -------- Original Message --------
> Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function 
> to get vfsmount from a given sb.
> From: Miao Xie <miaoxie@huawei.com>
> To: Qu Wenruo <quwenruo@cn.fujitsu.com>, <linux-btrfs@vger.kernel.org>
> Date: 2015年01月30日 08:52
>> On Thu, 29 Jan 2015 10:24:39 +0800, Qu Wenruo wrote:
>>> There are sysfs interfaces in some fs, only btrfs yet, which will 
>>> modify
>>> on-disk data.
>>> Unlike normal file operation routine we can use 
>>> mnt_want_write_file() to
>>> protect the operation, change through sysfs won't to be binded to 
>>> any file
>>> in the filesystem.
>>> So we can only extract the first vfsmount of a superblock and pass 
>>> it to
>>> mnt_want_write() to do the protection.
>> This method is wrong, becasue one fs  may be mounted on the multi places
>> at the same time, someone is R/O, someone is R/W, you may get a R/O and
>> fail to get the write permission.
> This shouldn't happen. If someone is ro, the whole fs should be ro, 
> right?
> You can mount a device which is already mounted as rw to other point 
> as ro,
> and remount a mount point to ro will also cause all other mount point 
> to ro.
>
> So I didn't see the problem here.
>>
>> I think you do label/feature change by sysfs interface by the 
>> following way
>>
>> btrfs_sysfs_change_XXXX()
>> {
>>     /* Use trylock to avoid the race with umount */
>>     if(!mutex_trylock(&sb->s_umount))
>>         return -EBUSY;
>>
>>     check R/O and FREEZE
>>
>>     mutex_unlock(&sb->s_umount);
>> }
> This looks better since it not introduce changes to VFS.
>
> Thanks,
> Qu
Oh, wait a second, this one leads to the old problem and old solution.

If we hold s_umount mutex, we must do freeze check and can't start 
transaction since it will deadlock.

And for freeze check, we must use sb_try_start_intwrite() to hold the 
freeze lock and then add a new
btrfs_start_transaction_freeze() which will not call sb_start_write()...

Oh this seems so similar, v2 or v3 version RFC patch?
So still goes to the old method?

Thanks,
Qu
>>
>> Thanks
>> Miao
>>
>>> Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>
>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>> ---
>>>   fs/namespace.c        | 25 +++++++++++++++++++++++++
>>>   include/linux/mount.h |  1 +
>>>   2 files changed, 26 insertions(+)
>>>
>>> diff --git a/fs/namespace.c b/fs/namespace.c
>>> index cd1e968..5a16a62 100644
>>> --- a/fs/namespace.c
>>> +++ b/fs/namespace.c
>>> @@ -1105,6 +1105,31 @@ struct vfsmount *mntget(struct vfsmount *mnt)
>>>   }
>>>   EXPORT_SYMBOL(mntget);
>>>   +/*
>>> + * get a vfsmount from a given sb
>>> + *
>>> + * This is especially used for case where change fs' sysfs interface
>>> + * will lead to a write, e.g. Change label through sysfs in btrfs.
>>> + * So vfs can get a vfsmount and then use mnt_want_write() to protect.
>>> + */
>>> +struct vfsmount *get_vfsmount_sb(struct super_block *sb)
>>> +{
>>> +    struct vfsmount *ret_vfs = NULL;
>>> +    struct mount *mnt;
>>> +    int ret = 0;
>>> +
>>> +    lock_mount_hash();
>>> +    if (list_empty(&sb->s_mounts))
>>> +        goto out;
>>> +    mnt = list_entry(sb->s_mounts.next, struct mount, mnt_instance);
>>> +    ret_vfs = &mnt->mnt;
>>> +    ret_vfs = mntget(ret_vfs);
>>> +out:
>>> +    unlock_mount_hash();
>>> +    return ret_vfs;
>>> +}
>>> +EXPORT_SYMBOL(get_vfsmount_sb);
>>> +
>>>   struct vfsmount *mnt_clone_internal(struct path *path)
>>>   {
>>>       struct mount *p;
>>> diff --git a/include/linux/mount.h b/include/linux/mount.h
>>> index c2c561d..cf1b0f5 100644
>>> --- a/include/linux/mount.h
>>> +++ b/include/linux/mount.h
>>> @@ -79,6 +79,7 @@ extern void mnt_drop_write_file(struct file *file);
>>>   extern void mntput(struct vfsmount *mnt);
>>>   extern struct vfsmount *mntget(struct vfsmount *mnt);
>>>   extern struct vfsmount *mnt_clone_internal(struct path *path);
>>> +extern struct vfsmount *get_vfsmount_sb(struct super_block *sb);
>>>   extern int __mnt_is_readonly(struct vfsmount *mnt);
>>>     struct path;
>>>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount option in a atomic way
  2015-01-30  1:33         ` Qu Wenruo
@ 2015-01-30  2:06           ` Miao Xie
  2015-01-30  2:51             ` Qu Wenruo
  0 siblings, 1 reply; 40+ messages in thread
From: Miao Xie @ 2015-01-30  2:06 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On Fri, 30 Jan 2015 09:33:17 +0800, Qu Wenruo wrote:
> 
> -------- Original Message --------
> Subject: Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount
> option in a atomic way
> From: Miao Xie <miaoxie@huawei.com>
> To: Qu Wenruo <quwenruo@cn.fujitsu.com>, <linux-btrfs@vger.kernel.org>
> Date: 2015年01月30日 09:29
>> On Fri, 30 Jan 2015 09:20:46 +0800, Qu Wenruo wrote:
>>>> Here need ACCESS_ONCE to wrap info->mount_opt, or the complier might use
>>>> info->mount_opt instead of new_opt.
>>> Thanks for pointing out this one.
>>>> But I worried that this is not key reason of the wrong space cache. Could
>>>> you explain the race condition which caused the wrong space cache?
>>>>
>>>> Thanks
>>>> Miao
>>> CPU0:
>>> remount()
>>> |- sync_fs() <- after sync_fs() we can start new trans
>>> |- btrfs_parse_options()     CPU1:
>>>                      |- start_transaction()
>>>                      |- Do some bg allocation, not recorded in space_cache.
>> I think it is a bug if a free space is not recorded in space cache. Could you
>> explain why it is not recorded?
>>
>> Thanks
>> Miao
> IIRC, in that window, the fs_info->mount_opt's SPACE_CACHE bit is cleared.
> So space cache is not recorded.

SPACE_CACHE is used to control cache write out, not in-memory cache. All the
free space should be recorded in in-memory cache.And when we write out
the in-memory space cache, we need protect the space cache from changing.

Thanks
Miao

> 
> Thanks,
> Qu
>>
>>>       |- set SPACE_CACHE bit due to cache_gen
>>>
>>>                      |- commit_transaction()
>>>                          |- write space cache and update cache_gen.
>>>                              but since some of it is not recorded in space
>>> cache,
>>>                              the space cache missing some records.
>>>       |- clear SPACE_CACHE bit dut to nospace_cache
>>>
>>> So the space cache is wrong.
>>>
>>> Thanks,
>>> Qu
>>>>> +    }
>>>>>        kfree(orig);
>>>>>        return ret;
>>>>>    }
>>>>>
>>> .
>>>
> 
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
  2015-01-30  1:11         ` Qu Wenruo
  (?)
@ 2015-01-30  2:09         ` Al Viro
  2015-01-30  2:20             ` Qu Wenruo
  -1 siblings, 1 reply; 40+ messages in thread
From: Al Viro @ 2015-01-30  2:09 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, linux-btrfs, linux-fsdevel

On Fri, Jan 30, 2015 at 09:11:26AM +0800, Qu Wenruo wrote:
> For the mounted ro case, that's not a problem, since if one instance
> is mounted ro,
> other instances are also mounted ro, so that's not a problem.

Not really.

root@satch:~# cd /tmp/
root@satch:~# mkdir /tm/a
root@satch:~# mount --bind /tmp/ /tmp/a
root@satch:~# mount --bind -o remount,ro /tmp/a
root@satch:~# mkdir /tmp/b  
root@satch:~# mkdir /tmp/a/c
mkdir: cannot create directory '/tmp/a/c': Read-only file system
root@satch:~# stat /tmp/b /tmp/a/b
  File: '/tmp/b'
  Size: 4096            Blocks: 8          IO Block: 4096   directory
Device: 806h/2054d      Inode: 257537      Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2015-01-29 21:03:35.000000000 -0500
Modify: 2015-01-29 21:03:35.000000000 -0500
Change: 2015-01-29 21:03:35.000000000 -0500
 Birth: -
  File: '/tmp/a/b'
  Size: 4096            Blocks: 8          IO Block: 4096   directory
Device: 806h/2054d      Inode: 257537      Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2015-01-29 21:03:35.000000000 -0500
Modify: 2015-01-29 21:03:35.000000000 -0500
Change: 2015-01-29 21:03:35.000000000 -0500
 Birth: -
root@satch:~#

IOW, /tmp and /tmp/a bear the same filesystem (one from /dev/sda6), the former
is mounted r/w, the latter - r/o.

> >Assuming that your superblock is guaranteed to stay alive and usable for
> >whatever work you are trying to do, what's wrong with sb_want_write()?
> Did you mean change the function name and it's parameter to
> sb_want_write(sb) and sb_drop_write(sb).
> That looks much better.
> But I'm a little worried about just using sb_start_write() and
> s_readonly_remount/s_flags to do the protection,
> will it be enough?

Protection against what?  If superblock is r/w, it will stay r/w until you
do sb_drop_write(); if it's r/o, sb_want_write() will fail.  There might be
any number of vfsmounts over superblock; attempt to get write access via
vfsmount will succeed only of both the vfsmount and superblock are r/w -
mnt_want_write() does sb_want_write() and fails if that has failed.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
  2015-01-30  1:44       ` Qu Wenruo
  (?)
  (?)
@ 2015-01-30  2:14       ` Al Viro
  2015-01-30  4:14         ` Miao Xie
  -1 siblings, 1 reply; 40+ messages in thread
From: Al Viro @ 2015-01-30  2:14 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Miao Xie, linux-btrfs, linux-fsdevel

On Fri, Jan 30, 2015 at 09:44:03AM +0800, Qu Wenruo wrote:

> This shouldn't happen. If someone is ro, the whole fs should be ro, right?

Wrong.  Individual vfsmounts over an r/w superblock might very well be r/o.
As for that trylock...  What for?  It invites transient failures for no
good reason.  Removal of sysfs entry will block while write(2) to that sucker
is in progress, so btrfs shutdown will block at that point in ctree_close().
It won't go away under you.

Now, you might want to move those sysfs entry removals to the very beginning
of btrfs_kill_super(), but that's a different story - you need only to make
sure that they are removed not later than the destruction of the data
structures they need (IOW, the current location might very well be OK - I
hadn't checked the details).

As for "it won't go r/o under us" - sb_want_write() will do that just fine.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
  2015-01-30  2:09         ` Al Viro
@ 2015-01-30  2:20             ` Qu Wenruo
  0 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2015-01-30  2:20 UTC (permalink / raw)
  To: Al Viro; +Cc: dsterba, linux-btrfs, linux-fsdevel


-------- Original Message --------
Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function 
to get vfsmount from a given sb.
From: Al Viro <viro@ZenIV.linux.org.uk>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>
Date: 2015年01月30日 10:09
> On Fri, Jan 30, 2015 at 09:11:26AM +0800, Qu Wenruo wrote:
>> For the mounted ro case, that's not a problem, since if one instance
>> is mounted ro,
>> other instances are also mounted ro, so that's not a problem.
> Not really.
>
> root@satch:~# cd /tmp/
> root@satch:~# mkdir /tm/a
> root@satch:~# mount --bind /tmp/ /tmp/a
> root@satch:~# mount --bind -o remount,ro /tmp/a
> root@satch:~# mkdir /tmp/b
> root@satch:~# mkdir /tmp/a/c
> mkdir: cannot create directory '/tmp/a/c': Read-only file system
> root@satch:~# stat /tmp/b /tmp/a/b
>    File: '/tmp/b'
>    Size: 4096            Blocks: 8          IO Block: 4096   directory
> Device: 806h/2054d      Inode: 257537      Links: 2
> Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2015-01-29 21:03:35.000000000 -0500
> Modify: 2015-01-29 21:03:35.000000000 -0500
> Change: 2015-01-29 21:03:35.000000000 -0500
>   Birth: -
>    File: '/tmp/a/b'
>    Size: 4096            Blocks: 8          IO Block: 4096   directory
> Device: 806h/2054d      Inode: 257537      Links: 2
> Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2015-01-29 21:03:35.000000000 -0500
> Modify: 2015-01-29 21:03:35.000000000 -0500
> Change: 2015-01-29 21:03:35.000000000 -0500
>   Birth: -
> root@satch:~#
>
> IOW, /tmp and /tmp/a bear the same filesystem (one from /dev/sda6), the former
> is mounted r/w, the latter - r/o.
Oh, bind mount.....

I was so stupid to forget bind.
>
>>> Assuming that your superblock is guaranteed to stay alive and usable for
>>> whatever work you are trying to do, what's wrong with sb_want_write()?
>> Did you mean change the function name and it's parameter to
>> sb_want_write(sb) and sb_drop_write(sb).
>> That looks much better.
>> But I'm a little worried about just using sb_start_write() and
>> s_readonly_remount/s_flags to do the protection,
>> will it be enough?
> Protection against what?  If superblock is r/w, it will stay r/w until you
> do sb_drop_write(); if it's r/o, sb_want_write() will fail.  There might be
> any number of vfsmounts over superblock; attempt to get write access via
> vfsmount will succeed only of both the vfsmount and superblock are r/w -
> mnt_want_write() does sb_want_write() and fails if that has failed.
Nice.
I'll add sb_want_write() and sb_drop_write().

Tons of thanks for all your advice!
Qu

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
@ 2015-01-30  2:20             ` Qu Wenruo
  0 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2015-01-30  2:20 UTC (permalink / raw)
  To: Al Viro; +Cc: dsterba, linux-btrfs, linux-fsdevel


-------- Original Message --------
Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function 
to get vfsmount from a given sb.
From: Al Viro <viro@ZenIV.linux.org.uk>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>
Date: 2015年01月30日 10:09
> On Fri, Jan 30, 2015 at 09:11:26AM +0800, Qu Wenruo wrote:
>> For the mounted ro case, that's not a problem, since if one instance
>> is mounted ro,
>> other instances are also mounted ro, so that's not a problem.
> Not really.
>
> root@satch:~# cd /tmp/
> root@satch:~# mkdir /tm/a
> root@satch:~# mount --bind /tmp/ /tmp/a
> root@satch:~# mount --bind -o remount,ro /tmp/a
> root@satch:~# mkdir /tmp/b
> root@satch:~# mkdir /tmp/a/c
> mkdir: cannot create directory '/tmp/a/c': Read-only file system
> root@satch:~# stat /tmp/b /tmp/a/b
>    File: '/tmp/b'
>    Size: 4096            Blocks: 8          IO Block: 4096   directory
> Device: 806h/2054d      Inode: 257537      Links: 2
> Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2015-01-29 21:03:35.000000000 -0500
> Modify: 2015-01-29 21:03:35.000000000 -0500
> Change: 2015-01-29 21:03:35.000000000 -0500
>   Birth: -
>    File: '/tmp/a/b'
>    Size: 4096            Blocks: 8          IO Block: 4096   directory
> Device: 806h/2054d      Inode: 257537      Links: 2
> Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2015-01-29 21:03:35.000000000 -0500
> Modify: 2015-01-29 21:03:35.000000000 -0500
> Change: 2015-01-29 21:03:35.000000000 -0500
>   Birth: -
> root@satch:~#
>
> IOW, /tmp and /tmp/a bear the same filesystem (one from /dev/sda6), the former
> is mounted r/w, the latter - r/o.
Oh, bind mount.....

I was so stupid to forget bind.
>
>>> Assuming that your superblock is guaranteed to stay alive and usable for
>>> whatever work you are trying to do, what's wrong with sb_want_write()?
>> Did you mean change the function name and it's parameter to
>> sb_want_write(sb) and sb_drop_write(sb).
>> That looks much better.
>> But I'm a little worried about just using sb_start_write() and
>> s_readonly_remount/s_flags to do the protection,
>> will it be enough?
> Protection against what?  If superblock is r/w, it will stay r/w until you
> do sb_drop_write(); if it's r/o, sb_want_write() will fail.  There might be
> any number of vfsmounts over superblock; attempt to get write access via
> vfsmount will succeed only of both the vfsmount and superblock are r/w -
> mnt_want_write() does sb_want_write() and fails if that has failed.
Nice.
I'll add sb_want_write() and sb_drop_write().

Tons of thanks for all your advice!
Qu
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount option in a atomic way
  2015-01-30  2:06           ` Miao Xie
@ 2015-01-30  2:51             ` Qu Wenruo
  2015-01-30  3:21               ` Miao Xie
  0 siblings, 1 reply; 40+ messages in thread
From: Qu Wenruo @ 2015-01-30  2:51 UTC (permalink / raw)
  To: Miao Xie, linux-btrfs


-------- Original Message --------
Subject: Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() 
parse mount option in a atomic way
From: Miao Xie <miaoxie@huawei.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>, <linux-btrfs@vger.kernel.org>
Date: 2015年01月30日 10:06
> On Fri, 30 Jan 2015 09:33:17 +0800, Qu Wenruo wrote:
>> -------- Original Message --------
>> Subject: Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount
>> option in a atomic way
>> From: Miao Xie <miaoxie@huawei.com>
>> To: Qu Wenruo <quwenruo@cn.fujitsu.com>, <linux-btrfs@vger.kernel.org>
>> Date: 2015年01月30日 09:29
>>> On Fri, 30 Jan 2015 09:20:46 +0800, Qu Wenruo wrote:
>>>>> Here need ACCESS_ONCE to wrap info->mount_opt, or the complier might use
>>>>> info->mount_opt instead of new_opt.
>>>> Thanks for pointing out this one.
>>>>> But I worried that this is not key reason of the wrong space cache. Could
>>>>> you explain the race condition which caused the wrong space cache?
>>>>>
>>>>> Thanks
>>>>> Miao
>>>> CPU0:
>>>> remount()
>>>> |- sync_fs() <- after sync_fs() we can start new trans
>>>> |- btrfs_parse_options()     CPU1:
>>>>                       |- start_transaction()
>>>>                       |- Do some bg allocation, not recorded in space_cache.
>>> I think it is a bug if a free space is not recorded in space cache. Could you
>>> explain why it is not recorded?
>>>
>>> Thanks
>>> Miao
>> IIRC, in that window, the fs_info->mount_opt's SPACE_CACHE bit is cleared.
>> So space cache is not recorded.
> SPACE_CACHE is used to control cache write out, not in-memory cache. All the
> free space should be recorded in in-memory cache.And when we write out
> the in-memory space cache, we need protect the space cache from changing.
>
> Thanks
> Miao
You're right, the wrong space cache problem is not caused by the 
non-atomic mount option problem.
But the atomic mount option change with per-transaction mount option 
will at least make it disappear
when using nospace_cache mount option.

Thanks,
Qu
>
>> Thanks,
>> Qu
>>>>        |- set SPACE_CACHE bit due to cache_gen
>>>>
>>>>                       |- commit_transaction()
>>>>                           |- write space cache and update cache_gen.
>>>>                               but since some of it is not recorded in space
>>>> cache,
>>>>                               the space cache missing some records.
>>>>        |- clear SPACE_CACHE bit dut to nospace_cache
>>>>
>>>> So the space cache is wrong.
>>>>
>>>> Thanks,
>>>> Qu
>>>>>> +    }
>>>>>>         kfree(orig);
>>>>>>         return ret;
>>>>>>     }
>>>>>>
>>>> .
>>>>
>>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount option in a atomic way
  2015-01-30  2:51             ` Qu Wenruo
@ 2015-01-30  3:21               ` Miao Xie
  2015-01-30  3:25                 ` Qu Wenruo
  0 siblings, 1 reply; 40+ messages in thread
From: Miao Xie @ 2015-01-30  3:21 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On Fri, 30 Jan 2015 10:51:52 +0800, Qu Wenruo wrote:
> 
> -------- Original Message --------
> Subject: Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount
> option in a atomic way
> From: Miao Xie <miaoxie@huawei.com>
> To: Qu Wenruo <quwenruo@cn.fujitsu.com>, <linux-btrfs@vger.kernel.org>
> Date: 2015年01月30日 10:06
>> On Fri, 30 Jan 2015 09:33:17 +0800, Qu Wenruo wrote:
>>> -------- Original Message --------
>>> Subject: Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount
>>> option in a atomic way
>>> From: Miao Xie <miaoxie@huawei.com>
>>> To: Qu Wenruo <quwenruo@cn.fujitsu.com>, <linux-btrfs@vger.kernel.org>
>>> Date: 2015年01月30日 09:29
>>>> On Fri, 30 Jan 2015 09:20:46 +0800, Qu Wenruo wrote:
>>>>>> Here need ACCESS_ONCE to wrap info->mount_opt, or the complier might use
>>>>>> info->mount_opt instead of new_opt.
>>>>> Thanks for pointing out this one.
>>>>>> But I worried that this is not key reason of the wrong space cache. Could
>>>>>> you explain the race condition which caused the wrong space cache?
>>>>>>
>>>>>> Thanks
>>>>>> Miao
>>>>> CPU0:
>>>>> remount()
>>>>> |- sync_fs() <- after sync_fs() we can start new trans
>>>>> |- btrfs_parse_options()     CPU1:
>>>>>                       |- start_transaction()
>>>>>                       |- Do some bg allocation, not recorded in space_cache.
>>>> I think it is a bug if a free space is not recorded in space cache. Could you
>>>> explain why it is not recorded?
>>>>
>>>> Thanks
>>>> Miao
>>> IIRC, in that window, the fs_info->mount_opt's SPACE_CACHE bit is cleared.
>>> So space cache is not recorded.
>> SPACE_CACHE is used to control cache write out, not in-memory cache. All the
>> free space should be recorded in in-memory cache.And when we write out
>> the in-memory space cache, we need protect the space cache from changing.
>>
>> Thanks
>> Miao
> You're right, the wrong space cache problem is not caused by the non-atomic
> mount option problem.
> But the atomic mount option change with per-transaction mount option will at
> least make it disappear
> when using nospace_cache mount option.

But we need fix a problem, not hide a problem.

Thanks
Miao

> 
> Thanks,
> Qu
>>
>>> Thanks,
>>> Qu
>>>>>        |- set SPACE_CACHE bit due to cache_gen
>>>>>
>>>>>                       |- commit_transaction()
>>>>>                           |- write space cache and update cache_gen.
>>>>>                               but since some of it is not recorded in space
>>>>> cache,
>>>>>                               the space cache missing some records.
>>>>>        |- clear SPACE_CACHE bit dut to nospace_cache
>>>>>
>>>>> So the space cache is wrong.
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>>>>> +    }
>>>>>>>         kfree(orig);
>>>>>>>         return ret;
>>>>>>>     }
>>>>>>>
>>>>> .
>>>>>
>>>
> 
> .
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
  2015-01-30  2:02         ` Qu Wenruo
@ 2015-01-30  3:22           ` Miao Xie
  -1 siblings, 0 replies; 40+ messages in thread
From: Miao Xie @ 2015-01-30  3:22 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: linux-fsdevel

On Fri, 30 Jan 2015 10:02:26 +0800, Qu Wenruo wrote:
> 
> -------- Original Message --------
> Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get
> vfsmount from a given sb.
> From: Qu Wenruo <quwenruo@cn.fujitsu.com>
> To: Miao Xie <miaoxie@huawei.com>, linux-btrfs@vger.kernel.org
> Date: 2015年01月30日 09:44
>>
>> -------- Original Message --------
>> Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get
>> vfsmount from a given sb.
>> From: Miao Xie <miaoxie@huawei.com>
>> To: Qu Wenruo <quwenruo@cn.fujitsu.com>, <linux-btrfs@vger.kernel.org>
>> Date: 2015年01月30日 08:52
>>> On Thu, 29 Jan 2015 10:24:39 +0800, Qu Wenruo wrote:
>>>> There are sysfs interfaces in some fs, only btrfs yet, which will modify
>>>> on-disk data.
>>>> Unlike normal file operation routine we can use mnt_want_write_file() to
>>>> protect the operation, change through sysfs won't to be binded to any file
>>>> in the filesystem.
>>>> So we can only extract the first vfsmount of a superblock and pass it to
>>>> mnt_want_write() to do the protection.
>>> This method is wrong, becasue one fs  may be mounted on the multi places
>>> at the same time, someone is R/O, someone is R/W, you may get a R/O and
>>> fail to get the write permission.
>> This shouldn't happen. If someone is ro, the whole fs should be ro, right?
>> You can mount a device which is already mounted as rw to other point as ro,
>> and remount a mount point to ro will also cause all other mount point to ro.
>>
>> So I didn't see the problem here.
>>>
>>> I think you do label/feature change by sysfs interface by the following way
>>>
>>> btrfs_sysfs_change_XXXX()
>>> {
>>>     /* Use trylock to avoid the race with umount */
>>>     if(!mutex_trylock(&sb->s_umount))
>>>         return -EBUSY;
>>>
>>>     check R/O and FREEZE
>>>
>>>     mutex_unlock(&sb->s_umount);
>>> }
>> This looks better since it not introduce changes to VFS.
>>
>> Thanks,
>> Qu
> Oh, wait a second, this one leads to the old problem and old solution.
> 
> If we hold s_umount mutex, we must do freeze check and can't start transaction
> since it will deadlock.
> 
> And for freeze check, we must use sb_try_start_intwrite() to hold the freeze
> lock and then add a new
> btrfs_start_transaction_freeze() which will not call sb_start_write()...
> 
> Oh this seems so similar, v2 or v3 version RFC patch?
> So still goes to the old method?

No. Just check R/O and RREEZE, if failed, go out. if the check pass,
we start_transaction. Because we do it in s_umount lock, no one can
change fs to R/O or FREEZE.

Maybe the above description is not so clear, explain it again.

btrfs_sysfs_change_XXXX()
{
	/* Use trylock to avoid the race with umount */
	if(!mutex_trylock(&sb->s_umount))
		return -EBUSY;

	if (fs is R/O or FREEZED) {
		mutex_unlock(&sb->s_umount);
		return -EACCES;
	}

	btrfs_start_transaction()
	change label/feature
	btrfs_commit_transaction()

	mutex_unlock(&sb->s_umount);
}

Thanks
Miao

> 
> Thanks,
> Qu
>>>
>>> Thanks
>>> Miao
>>>
>>>> Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>
>>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>>> ---
>>>>   fs/namespace.c        | 25 +++++++++++++++++++++++++
>>>>   include/linux/mount.h |  1 +
>>>>   2 files changed, 26 insertions(+)
>>>>
>>>> diff --git a/fs/namespace.c b/fs/namespace.c
>>>> index cd1e968..5a16a62 100644
>>>> --- a/fs/namespace.c
>>>> +++ b/fs/namespace.c
>>>> @@ -1105,6 +1105,31 @@ struct vfsmount *mntget(struct vfsmount *mnt)
>>>>   }
>>>>   EXPORT_SYMBOL(mntget);
>>>>   +/*
>>>> + * get a vfsmount from a given sb
>>>> + *
>>>> + * This is especially used for case where change fs' sysfs interface
>>>> + * will lead to a write, e.g. Change label through sysfs in btrfs.
>>>> + * So vfs can get a vfsmount and then use mnt_want_write() to protect.
>>>> + */
>>>> +struct vfsmount *get_vfsmount_sb(struct super_block *sb)
>>>> +{
>>>> +    struct vfsmount *ret_vfs = NULL;
>>>> +    struct mount *mnt;
>>>> +    int ret = 0;
>>>> +
>>>> +    lock_mount_hash();
>>>> +    if (list_empty(&sb->s_mounts))
>>>> +        goto out;
>>>> +    mnt = list_entry(sb->s_mounts.next, struct mount, mnt_instance);
>>>> +    ret_vfs = &mnt->mnt;
>>>> +    ret_vfs = mntget(ret_vfs);
>>>> +out:
>>>> +    unlock_mount_hash();
>>>> +    return ret_vfs;
>>>> +}
>>>> +EXPORT_SYMBOL(get_vfsmount_sb);
>>>> +
>>>>   struct vfsmount *mnt_clone_internal(struct path *path)
>>>>   {
>>>>       struct mount *p;
>>>> diff --git a/include/linux/mount.h b/include/linux/mount.h
>>>> index c2c561d..cf1b0f5 100644
>>>> --- a/include/linux/mount.h
>>>> +++ b/include/linux/mount.h
>>>> @@ -79,6 +79,7 @@ extern void mnt_drop_write_file(struct file *file);
>>>>   extern void mntput(struct vfsmount *mnt);
>>>>   extern struct vfsmount *mntget(struct vfsmount *mnt);
>>>>   extern struct vfsmount *mnt_clone_internal(struct path *path);
>>>> +extern struct vfsmount *get_vfsmount_sb(struct super_block *sb);
>>>>   extern int __mnt_is_readonly(struct vfsmount *mnt);
>>>>     struct path;
>>>>
>>
> 
> .
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
@ 2015-01-30  3:22           ` Miao Xie
  0 siblings, 0 replies; 40+ messages in thread
From: Miao Xie @ 2015-01-30  3:22 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: linux-fsdevel

On Fri, 30 Jan 2015 10:02:26 +0800, Qu Wenruo wrote:
> 
> -------- Original Message --------
> Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get
> vfsmount from a given sb.
> From: Qu Wenruo <quwenruo@cn.fujitsu.com>
> To: Miao Xie <miaoxie@huawei.com>, linux-btrfs@vger.kernel.org
> Date: 2015年01月30日 09:44
>>
>> -------- Original Message --------
>> Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get
>> vfsmount from a given sb.
>> From: Miao Xie <miaoxie@huawei.com>
>> To: Qu Wenruo <quwenruo@cn.fujitsu.com>, <linux-btrfs@vger.kernel.org>
>> Date: 2015年01月30日 08:52
>>> On Thu, 29 Jan 2015 10:24:39 +0800, Qu Wenruo wrote:
>>>> There are sysfs interfaces in some fs, only btrfs yet, which will modify
>>>> on-disk data.
>>>> Unlike normal file operation routine we can use mnt_want_write_file() to
>>>> protect the operation, change through sysfs won't to be binded to any file
>>>> in the filesystem.
>>>> So we can only extract the first vfsmount of a superblock and pass it to
>>>> mnt_want_write() to do the protection.
>>> This method is wrong, becasue one fs  may be mounted on the multi places
>>> at the same time, someone is R/O, someone is R/W, you may get a R/O and
>>> fail to get the write permission.
>> This shouldn't happen. If someone is ro, the whole fs should be ro, right?
>> You can mount a device which is already mounted as rw to other point as ro,
>> and remount a mount point to ro will also cause all other mount point to ro.
>>
>> So I didn't see the problem here.
>>>
>>> I think you do label/feature change by sysfs interface by the following way
>>>
>>> btrfs_sysfs_change_XXXX()
>>> {
>>>     /* Use trylock to avoid the race with umount */
>>>     if(!mutex_trylock(&sb->s_umount))
>>>         return -EBUSY;
>>>
>>>     check R/O and FREEZE
>>>
>>>     mutex_unlock(&sb->s_umount);
>>> }
>> This looks better since it not introduce changes to VFS.
>>
>> Thanks,
>> Qu
> Oh, wait a second, this one leads to the old problem and old solution.
> 
> If we hold s_umount mutex, we must do freeze check and can't start transaction
> since it will deadlock.
> 
> And for freeze check, we must use sb_try_start_intwrite() to hold the freeze
> lock and then add a new
> btrfs_start_transaction_freeze() which will not call sb_start_write()...
> 
> Oh this seems so similar, v2 or v3 version RFC patch?
> So still goes to the old method?

No. Just check R/O and RREEZE, if failed, go out. if the check pass,
we start_transaction. Because we do it in s_umount lock, no one can
change fs to R/O or FREEZE.

Maybe the above description is not so clear, explain it again.

btrfs_sysfs_change_XXXX()
{
	/* Use trylock to avoid the race with umount */
	if(!mutex_trylock(&sb->s_umount))
		return -EBUSY;

	if (fs is R/O or FREEZED) {
		mutex_unlock(&sb->s_umount);
		return -EACCES;
	}

	btrfs_start_transaction()
	change label/feature
	btrfs_commit_transaction()

	mutex_unlock(&sb->s_umount);
}

Thanks
Miao

> 
> Thanks,
> Qu
>>>
>>> Thanks
>>> Miao
>>>
>>>> Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>
>>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>>> ---
>>>>   fs/namespace.c        | 25 +++++++++++++++++++++++++
>>>>   include/linux/mount.h |  1 +
>>>>   2 files changed, 26 insertions(+)
>>>>
>>>> diff --git a/fs/namespace.c b/fs/namespace.c
>>>> index cd1e968..5a16a62 100644
>>>> --- a/fs/namespace.c
>>>> +++ b/fs/namespace.c
>>>> @@ -1105,6 +1105,31 @@ struct vfsmount *mntget(struct vfsmount *mnt)
>>>>   }
>>>>   EXPORT_SYMBOL(mntget);
>>>>   +/*
>>>> + * get a vfsmount from a given sb
>>>> + *
>>>> + * This is especially used for case where change fs' sysfs interface
>>>> + * will lead to a write, e.g. Change label through sysfs in btrfs.
>>>> + * So vfs can get a vfsmount and then use mnt_want_write() to protect.
>>>> + */
>>>> +struct vfsmount *get_vfsmount_sb(struct super_block *sb)
>>>> +{
>>>> +    struct vfsmount *ret_vfs = NULL;
>>>> +    struct mount *mnt;
>>>> +    int ret = 0;
>>>> +
>>>> +    lock_mount_hash();
>>>> +    if (list_empty(&sb->s_mounts))
>>>> +        goto out;
>>>> +    mnt = list_entry(sb->s_mounts.next, struct mount, mnt_instance);
>>>> +    ret_vfs = &mnt->mnt;
>>>> +    ret_vfs = mntget(ret_vfs);
>>>> +out:
>>>> +    unlock_mount_hash();
>>>> +    return ret_vfs;
>>>> +}
>>>> +EXPORT_SYMBOL(get_vfsmount_sb);
>>>> +
>>>>   struct vfsmount *mnt_clone_internal(struct path *path)
>>>>   {
>>>>       struct mount *p;
>>>> diff --git a/include/linux/mount.h b/include/linux/mount.h
>>>> index c2c561d..cf1b0f5 100644
>>>> --- a/include/linux/mount.h
>>>> +++ b/include/linux/mount.h
>>>> @@ -79,6 +79,7 @@ extern void mnt_drop_write_file(struct file *file);
>>>>   extern void mntput(struct vfsmount *mnt);
>>>>   extern struct vfsmount *mntget(struct vfsmount *mnt);
>>>>   extern struct vfsmount *mnt_clone_internal(struct path *path);
>>>> +extern struct vfsmount *get_vfsmount_sb(struct super_block *sb);
>>>>   extern int __mnt_is_readonly(struct vfsmount *mnt);
>>>>     struct path;
>>>>
>>
> 
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount option in a atomic way
  2015-01-30  3:21               ` Miao Xie
@ 2015-01-30  3:25                 ` Qu Wenruo
  0 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2015-01-30  3:25 UTC (permalink / raw)
  To: Miao Xie, linux-btrfs


-------- Original Message --------
Subject: Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() 
parse mount option in a atomic way
From: Miao Xie <miaoxie@huawei.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>, <linux-btrfs@vger.kernel.org>
Date: 2015年01月30日 11:21
> On Fri, 30 Jan 2015 10:51:52 +0800, Qu Wenruo wrote:
>> -------- Original Message --------
>> Subject: Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount
>> option in a atomic way
>> From: Miao Xie <miaoxie@huawei.com>
>> To: Qu Wenruo <quwenruo@cn.fujitsu.com>, <linux-btrfs@vger.kernel.org>
>> Date: 2015年01月30日 10:06
>>> On Fri, 30 Jan 2015 09:33:17 +0800, Qu Wenruo wrote:
>>>> -------- Original Message --------
>>>> Subject: Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount
>>>> option in a atomic way
>>>> From: Miao Xie <miaoxie@huawei.com>
>>>> To: Qu Wenruo <quwenruo@cn.fujitsu.com>, <linux-btrfs@vger.kernel.org>
>>>> Date: 2015年01月30日 09:29
>>>>> On Fri, 30 Jan 2015 09:20:46 +0800, Qu Wenruo wrote:
>>>>>>> Here need ACCESS_ONCE to wrap info->mount_opt, or the complier might use
>>>>>>> info->mount_opt instead of new_opt.
>>>>>> Thanks for pointing out this one.
>>>>>>> But I worried that this is not key reason of the wrong space cache. Could
>>>>>>> you explain the race condition which caused the wrong space cache?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Miao
>>>>>> CPU0:
>>>>>> remount()
>>>>>> |- sync_fs() <- after sync_fs() we can start new trans
>>>>>> |- btrfs_parse_options()     CPU1:
>>>>>>                        |- start_transaction()
>>>>>>                        |- Do some bg allocation, not recorded in space_cache.
>>>>> I think it is a bug if a free space is not recorded in space cache. Could you
>>>>> explain why it is not recorded?
>>>>>
>>>>> Thanks
>>>>> Miao
>>>> IIRC, in that window, the fs_info->mount_opt's SPACE_CACHE bit is cleared.
>>>> So space cache is not recorded.
>>> SPACE_CACHE is used to control cache write out, not in-memory cache. All the
>>> free space should be recorded in in-memory cache.And when we write out
>>> the in-memory space cache, we need protect the space cache from changing.
>>>
>>> Thanks
>>> Miao
>> You're right, the wrong space cache problem is not caused by the non-atomic
>> mount option problem.
>> But the atomic mount option change with per-transaction mount option will at
>> least make it disappear
>> when using nospace_cache mount option.
> But we need fix a problem, not hide a problem.
>
> Thanks
> Miao
Yes, But I don't mean to hide it.
If it's a bug in space cache, it will still be reproducible on with 
space_cache mount option.

The patch itself is focused on the space cache created with 
nospace_cache mount option,
at least it's doing it job.

The wrong space cahce problem is another story then.
I'll remove the wrong space cache description in the commit message in 
next version.

Thanks,
Qu
>
>> Thanks,
>> Qu
>>>> Thanks,
>>>> Qu
>>>>>>         |- set SPACE_CACHE bit due to cache_gen
>>>>>>
>>>>>>                        |- commit_transaction()
>>>>>>                            |- write space cache and update cache_gen.
>>>>>>                                but since some of it is not recorded in space
>>>>>> cache,
>>>>>>                                the space cache missing some records.
>>>>>>         |- clear SPACE_CACHE bit dut to nospace_cache
>>>>>>
>>>>>> So the space cache is wrong.
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>>>>> +    }
>>>>>>>>          kfree(orig);
>>>>>>>>          return ret;
>>>>>>>>      }
>>>>>>>>
>>>>>> .
>>>>>>
>> .
>>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
  2015-01-30  3:22           ` Miao Xie
@ 2015-01-30  3:30             ` Qu Wenruo
  -1 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2015-01-30  3:30 UTC (permalink / raw)
  To: Miao Xie, linux-btrfs; +Cc: linux-fsdevel


-------- Original Message --------
Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function 
to get vfsmount from a given sb.
From: Miao Xie <miaoxie@huawei.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>, <linux-btrfs@vger.kernel.org>
Date: 2015年01月30日 11:22
> On Fri, 30 Jan 2015 10:02:26 +0800, Qu Wenruo wrote:
>> -------- Original Message --------
>> Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get
>> vfsmount from a given sb.
>> From: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> To: Miao Xie <miaoxie@huawei.com>, linux-btrfs@vger.kernel.org
>> Date: 2015年01月30日 09:44
>>> -------- Original Message --------
>>> Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get
>>> vfsmount from a given sb.
>>> From: Miao Xie <miaoxie@huawei.com>
>>> To: Qu Wenruo <quwenruo@cn.fujitsu.com>, <linux-btrfs@vger.kernel.org>
>>> Date: 2015年01月30日 08:52
>>>> On Thu, 29 Jan 2015 10:24:39 +0800, Qu Wenruo wrote:
>>>>> There are sysfs interfaces in some fs, only btrfs yet, which will modify
>>>>> on-disk data.
>>>>> Unlike normal file operation routine we can use mnt_want_write_file() to
>>>>> protect the operation, change through sysfs won't to be binded to any file
>>>>> in the filesystem.
>>>>> So we can only extract the first vfsmount of a superblock and pass it to
>>>>> mnt_want_write() to do the protection.
>>>> This method is wrong, becasue one fs  may be mounted on the multi places
>>>> at the same time, someone is R/O, someone is R/W, you may get a R/O and
>>>> fail to get the write permission.
>>> This shouldn't happen. If someone is ro, the whole fs should be ro, right?
>>> You can mount a device which is already mounted as rw to other point as ro,
>>> and remount a mount point to ro will also cause all other mount point to ro.
>>>
>>> So I didn't see the problem here.
>>>> I think you do label/feature change by sysfs interface by the following way
>>>>
>>>> btrfs_sysfs_change_XXXX()
>>>> {
>>>>      /* Use trylock to avoid the race with umount */
>>>>      if(!mutex_trylock(&sb->s_umount))
>>>>          return -EBUSY;
>>>>
>>>>      check R/O and FREEZE
>>>>
>>>>      mutex_unlock(&sb->s_umount);
>>>> }
>>> This looks better since it not introduce changes to VFS.
>>>
>>> Thanks,
>>> Qu
>> Oh, wait a second, this one leads to the old problem and old solution.
>>
>> If we hold s_umount mutex, we must do freeze check and can't start transaction
>> since it will deadlock.
>>
>> And for freeze check, we must use sb_try_start_intwrite() to hold the freeze
>> lock and then add a new
>> btrfs_start_transaction_freeze() which will not call sb_start_write()...
>>
>> Oh this seems so similar, v2 or v3 version RFC patch?
>> So still goes to the old method?
> No. Just check R/O and RREEZE, if failed, go out. if the check pass,
> we start_transaction. Because we do it in s_umount lock, no one can
> change fs to R/O or FREEZE.
>
> Maybe the above description is not so clear, explain it again.
>
> btrfs_sysfs_change_XXXX()
> {
> 	/* Use trylock to avoid the race with umount */
> 	if(!mutex_trylock(&sb->s_umount))
> 		return -EBUSY;
>
> 	if (fs is R/O or FREEZED) {
> 		mutex_unlock(&sb->s_umount);
> 		return -EACCES;
> 	}
>
> 	btrfs_start_transaction()
> 	change label/feature
> 	btrfs_commit_transaction()
>
> 	mutex_unlock(&sb->s_umount);
> }
I prefer the sb_want_write() method, since it doesn't even need to hold 
the s_umount mutex.

Thanks,
Qu
> Thanks
> Miao
>
>> Thanks,
>> Qu
>>>> Thanks
>>>> Miao
>>>>
>>>>> Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>
>>>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>>>> ---
>>>>>    fs/namespace.c        | 25 +++++++++++++++++++++++++
>>>>>    include/linux/mount.h |  1 +
>>>>>    2 files changed, 26 insertions(+)
>>>>>
>>>>> diff --git a/fs/namespace.c b/fs/namespace.c
>>>>> index cd1e968..5a16a62 100644
>>>>> --- a/fs/namespace.c
>>>>> +++ b/fs/namespace.c
>>>>> @@ -1105,6 +1105,31 @@ struct vfsmount *mntget(struct vfsmount *mnt)
>>>>>    }
>>>>>    EXPORT_SYMBOL(mntget);
>>>>>    +/*
>>>>> + * get a vfsmount from a given sb
>>>>> + *
>>>>> + * This is especially used for case where change fs' sysfs interface
>>>>> + * will lead to a write, e.g. Change label through sysfs in btrfs.
>>>>> + * So vfs can get a vfsmount and then use mnt_want_write() to protect.
>>>>> + */
>>>>> +struct vfsmount *get_vfsmount_sb(struct super_block *sb)
>>>>> +{
>>>>> +    struct vfsmount *ret_vfs = NULL;
>>>>> +    struct mount *mnt;
>>>>> +    int ret = 0;
>>>>> +
>>>>> +    lock_mount_hash();
>>>>> +    if (list_empty(&sb->s_mounts))
>>>>> +        goto out;
>>>>> +    mnt = list_entry(sb->s_mounts.next, struct mount, mnt_instance);
>>>>> +    ret_vfs = &mnt->mnt;
>>>>> +    ret_vfs = mntget(ret_vfs);
>>>>> +out:
>>>>> +    unlock_mount_hash();
>>>>> +    return ret_vfs;
>>>>> +}
>>>>> +EXPORT_SYMBOL(get_vfsmount_sb);
>>>>> +
>>>>>    struct vfsmount *mnt_clone_internal(struct path *path)
>>>>>    {
>>>>>        struct mount *p;
>>>>> diff --git a/include/linux/mount.h b/include/linux/mount.h
>>>>> index c2c561d..cf1b0f5 100644
>>>>> --- a/include/linux/mount.h
>>>>> +++ b/include/linux/mount.h
>>>>> @@ -79,6 +79,7 @@ extern void mnt_drop_write_file(struct file *file);
>>>>>    extern void mntput(struct vfsmount *mnt);
>>>>>    extern struct vfsmount *mntget(struct vfsmount *mnt);
>>>>>    extern struct vfsmount *mnt_clone_internal(struct path *path);
>>>>> +extern struct vfsmount *get_vfsmount_sb(struct super_block *sb);
>>>>>    extern int __mnt_is_readonly(struct vfsmount *mnt);
>>>>>      struct path;
>>>>>
>> .
>>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
@ 2015-01-30  3:30             ` Qu Wenruo
  0 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2015-01-30  3:30 UTC (permalink / raw)
  To: Miao Xie, linux-btrfs; +Cc: linux-fsdevel


-------- Original Message --------
Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function 
to get vfsmount from a given sb.
From: Miao Xie <miaoxie@huawei.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>, <linux-btrfs@vger.kernel.org>
Date: 2015年01月30日 11:22
> On Fri, 30 Jan 2015 10:02:26 +0800, Qu Wenruo wrote:
>> -------- Original Message --------
>> Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get
>> vfsmount from a given sb.
>> From: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> To: Miao Xie <miaoxie@huawei.com>, linux-btrfs@vger.kernel.org
>> Date: 2015年01月30日 09:44
>>> -------- Original Message --------
>>> Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get
>>> vfsmount from a given sb.
>>> From: Miao Xie <miaoxie@huawei.com>
>>> To: Qu Wenruo <quwenruo@cn.fujitsu.com>, <linux-btrfs@vger.kernel.org>
>>> Date: 2015年01月30日 08:52
>>>> On Thu, 29 Jan 2015 10:24:39 +0800, Qu Wenruo wrote:
>>>>> There are sysfs interfaces in some fs, only btrfs yet, which will modify
>>>>> on-disk data.
>>>>> Unlike normal file operation routine we can use mnt_want_write_file() to
>>>>> protect the operation, change through sysfs won't to be binded to any file
>>>>> in the filesystem.
>>>>> So we can only extract the first vfsmount of a superblock and pass it to
>>>>> mnt_want_write() to do the protection.
>>>> This method is wrong, becasue one fs  may be mounted on the multi places
>>>> at the same time, someone is R/O, someone is R/W, you may get a R/O and
>>>> fail to get the write permission.
>>> This shouldn't happen. If someone is ro, the whole fs should be ro, right?
>>> You can mount a device which is already mounted as rw to other point as ro,
>>> and remount a mount point to ro will also cause all other mount point to ro.
>>>
>>> So I didn't see the problem here.
>>>> I think you do label/feature change by sysfs interface by the following way
>>>>
>>>> btrfs_sysfs_change_XXXX()
>>>> {
>>>>      /* Use trylock to avoid the race with umount */
>>>>      if(!mutex_trylock(&sb->s_umount))
>>>>          return -EBUSY;
>>>>
>>>>      check R/O and FREEZE
>>>>
>>>>      mutex_unlock(&sb->s_umount);
>>>> }
>>> This looks better since it not introduce changes to VFS.
>>>
>>> Thanks,
>>> Qu
>> Oh, wait a second, this one leads to the old problem and old solution.
>>
>> If we hold s_umount mutex, we must do freeze check and can't start transaction
>> since it will deadlock.
>>
>> And for freeze check, we must use sb_try_start_intwrite() to hold the freeze
>> lock and then add a new
>> btrfs_start_transaction_freeze() which will not call sb_start_write()...
>>
>> Oh this seems so similar, v2 or v3 version RFC patch?
>> So still goes to the old method?
> No. Just check R/O and RREEZE, if failed, go out. if the check pass,
> we start_transaction. Because we do it in s_umount lock, no one can
> change fs to R/O or FREEZE.
>
> Maybe the above description is not so clear, explain it again.
>
> btrfs_sysfs_change_XXXX()
> {
> 	/* Use trylock to avoid the race with umount */
> 	if(!mutex_trylock(&sb->s_umount))
> 		return -EBUSY;
>
> 	if (fs is R/O or FREEZED) {
> 		mutex_unlock(&sb->s_umount);
> 		return -EACCES;
> 	}
>
> 	btrfs_start_transaction()
> 	change label/feature
> 	btrfs_commit_transaction()
>
> 	mutex_unlock(&sb->s_umount);
> }
I prefer the sb_want_write() method, since it doesn't even need to hold 
the s_umount mutex.

Thanks,
Qu
> Thanks
> Miao
>
>> Thanks,
>> Qu
>>>> Thanks
>>>> Miao
>>>>
>>>>> Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>
>>>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>>>> ---
>>>>>    fs/namespace.c        | 25 +++++++++++++++++++++++++
>>>>>    include/linux/mount.h |  1 +
>>>>>    2 files changed, 26 insertions(+)
>>>>>
>>>>> diff --git a/fs/namespace.c b/fs/namespace.c
>>>>> index cd1e968..5a16a62 100644
>>>>> --- a/fs/namespace.c
>>>>> +++ b/fs/namespace.c
>>>>> @@ -1105,6 +1105,31 @@ struct vfsmount *mntget(struct vfsmount *mnt)
>>>>>    }
>>>>>    EXPORT_SYMBOL(mntget);
>>>>>    +/*
>>>>> + * get a vfsmount from a given sb
>>>>> + *
>>>>> + * This is especially used for case where change fs' sysfs interface
>>>>> + * will lead to a write, e.g. Change label through sysfs in btrfs.
>>>>> + * So vfs can get a vfsmount and then use mnt_want_write() to protect.
>>>>> + */
>>>>> +struct vfsmount *get_vfsmount_sb(struct super_block *sb)
>>>>> +{
>>>>> +    struct vfsmount *ret_vfs = NULL;
>>>>> +    struct mount *mnt;
>>>>> +    int ret = 0;
>>>>> +
>>>>> +    lock_mount_hash();
>>>>> +    if (list_empty(&sb->s_mounts))
>>>>> +        goto out;
>>>>> +    mnt = list_entry(sb->s_mounts.next, struct mount, mnt_instance);
>>>>> +    ret_vfs = &mnt->mnt;
>>>>> +    ret_vfs = mntget(ret_vfs);
>>>>> +out:
>>>>> +    unlock_mount_hash();
>>>>> +    return ret_vfs;
>>>>> +}
>>>>> +EXPORT_SYMBOL(get_vfsmount_sb);
>>>>> +
>>>>>    struct vfsmount *mnt_clone_internal(struct path *path)
>>>>>    {
>>>>>        struct mount *p;
>>>>> diff --git a/include/linux/mount.h b/include/linux/mount.h
>>>>> index c2c561d..cf1b0f5 100644
>>>>> --- a/include/linux/mount.h
>>>>> +++ b/include/linux/mount.h
>>>>> @@ -79,6 +79,7 @@ extern void mnt_drop_write_file(struct file *file);
>>>>>    extern void mntput(struct vfsmount *mnt);
>>>>>    extern struct vfsmount *mntget(struct vfsmount *mnt);
>>>>>    extern struct vfsmount *mnt_clone_internal(struct path *path);
>>>>> +extern struct vfsmount *get_vfsmount_sb(struct super_block *sb);
>>>>>    extern int __mnt_is_readonly(struct vfsmount *mnt);
>>>>>      struct path;
>>>>>
>> .
>>

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
  2015-01-30  2:14       ` Al Viro
@ 2015-01-30  4:14         ` Miao Xie
  2015-01-30  4:37           ` Al Viro
  2015-01-30  5:30             ` Qu Wenruo
  0 siblings, 2 replies; 40+ messages in thread
From: Miao Xie @ 2015-01-30  4:14 UTC (permalink / raw)
  To: Al Viro, Qu Wenruo; +Cc: linux-btrfs, linux-fsdevel

On Fri, 30 Jan 2015 02:14:45 +0000, Al Viro wrote:
> On Fri, Jan 30, 2015 at 09:44:03AM +0800, Qu Wenruo wrote:
> 
>> This shouldn't happen. If someone is ro, the whole fs should be ro, right?
> 
> Wrong.  Individual vfsmounts over an r/w superblock might very well be r/o.
> As for that trylock...  What for?  It invites transient failures for no
> good reason.  Removal of sysfs entry will block while write(2) to that sucker
> is in progress, so btrfs shutdown will block at that point in ctree_close().
> It won't go away under you.

could you explain the race condition? I think the deadlock won't happen, during
the btrfs shutdown, we hold s_umount, the write operation will fail to lock it,
and quit quickly, and then umount will continue.

I think sb_want_write() is similar to trylock(s_umount), the difference is that
sb_want_write() is more complex.

> 
> Now, you might want to move those sysfs entry removals to the very beginning
> of btrfs_kill_super(), but that's a different story - you need only to make
> sure that they are removed not later than the destruction of the data
> structures they need (IOW, the current location might very well be OK - I
> hadn't checked the details).

Yes, we need move those sysfs entry removals, but needn't move to the very
beginning of btrfs_kill_super(), just at the beginning of close_ctree();

The current location is not right, it will introduce the use-after-free
problem. because we remove the sysfs entry after we release
transaction_kthread, use-after-free problem might happen in this case
	Task1				Task2
	change Label by sysfs
					close_ctree
					  kthread_stop(transaction_kthread);
	  change label
	  wake_up(transaction_kthread)


Thanks
Miao

> 
> As for "it won't go r/o under us" - sb_want_write() will do that just fine.
> .
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
  2015-01-30  4:14         ` Miao Xie
@ 2015-01-30  4:37           ` Al Viro
  2015-01-30  5:34             ` Miao Xie
  2015-01-30  5:30             ` Qu Wenruo
  1 sibling, 1 reply; 40+ messages in thread
From: Al Viro @ 2015-01-30  4:37 UTC (permalink / raw)
  To: Miao Xie; +Cc: Qu Wenruo, linux-btrfs, linux-fsdevel

On Fri, Jan 30, 2015 at 12:14:24PM +0800, Miao Xie wrote:
> On Fri, 30 Jan 2015 02:14:45 +0000, Al Viro wrote:
> > On Fri, Jan 30, 2015 at 09:44:03AM +0800, Qu Wenruo wrote:
> > 
> >> This shouldn't happen. If someone is ro, the whole fs should be ro, right?
> > 
> > Wrong.  Individual vfsmounts over an r/w superblock might very well be r/o.
> > As for that trylock...  What for?  It invites transient failures for no
> > good reason.  Removal of sysfs entry will block while write(2) to that sucker
> > is in progress, so btrfs shutdown will block at that point in ctree_close().
> > It won't go away under you.
> 
> could you explain the race condition? I think the deadlock won't happen, during
> the btrfs shutdown, we hold s_umount, the write operation will fail to lock it,
> and quit quickly, and then umount will continue.

	First of all, ->s_umount is not a mutex; it's rwsem.  So you mean
down_read_trylock().  As for the transient failures - grep for down_write
on it...  E.g. have somebody call mount() from the same device.  We call
sget(), which finds existing superblock and calls grab_super().  Sure,
that ->s_umount will be released shortly, but in the meanwhile your trylock
will fail...

> I think sb_want_write() is similar to trylock(s_umount), the difference is that
> sb_want_write() is more complex.
> 
> > 
> > Now, you might want to move those sysfs entry removals to the very beginning
> > of btrfs_kill_super(), but that's a different story - you need only to make
> > sure that they are removed not later than the destruction of the data
> > structures they need (IOW, the current location might very well be OK - I
> > hadn't checked the details).
> 
> Yes, we need move those sysfs entry removals, but needn't move to the very
> beginning of btrfs_kill_super(), just at the beginning of close_ctree();

So move them...  It's a matter of moving one function call around a bit.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
  2015-01-30  4:14         ` Miao Xie
@ 2015-01-30  5:30             ` Qu Wenruo
  2015-01-30  5:30             ` Qu Wenruo
  1 sibling, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2015-01-30  5:30 UTC (permalink / raw)
  To: Miao Xie, Al Viro; +Cc: linux-btrfs, linux-fsdevel


-------- Original Message --------
Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function 
to get vfsmount from a given sb.
From: Miao Xie <miaoxie@huawei.com>
To: Al Viro <viro@ZenIV.linux.org.uk>, Qu Wenruo <quwenruo@cn.fujitsu.com>
Date: 2015年01月30日 12:14
> On Fri, 30 Jan 2015 02:14:45 +0000, Al Viro wrote:
>> On Fri, Jan 30, 2015 at 09:44:03AM +0800, Qu Wenruo wrote:
>>
>>> This shouldn't happen. If someone is ro, the whole fs should be ro, right?
>> Wrong.  Individual vfsmounts over an r/w superblock might very well be r/o.
>> As for that trylock...  What for?  It invites transient failures for no
>> good reason.  Removal of sysfs entry will block while write(2) to that sucker
>> is in progress, so btrfs shutdown will block at that point in ctree_close().
>> It won't go away under you.
> could you explain the race condition? I think the deadlock won't happen, during
> the btrfs shutdown, we hold s_umount, the write operation will fail to lock it,
> and quit quickly, and then umount will continue.
>
> I think sb_want_write() is similar to trylock(s_umount), the difference is that
> sb_want_write() is more complex.
How?
sb_want_write() should be much like mnt_want_write(), except no need to 
increase vfsmout ref count things
and no need to check per mount ro/rw things.

Thanks,
Qu
>
>> Now, you might want to move those sysfs entry removals to the very beginning
>> of btrfs_kill_super(), but that's a different story - you need only to make
>> sure that they are removed not later than the destruction of the data
>> structures they need (IOW, the current location might very well be OK - I
>> hadn't checked the details).
> Yes, we need move those sysfs entry removals, but needn't move to the very
> beginning of btrfs_kill_super(), just at the beginning of close_ctree();
>
> The current location is not right, it will introduce the use-after-free
> problem. because we remove the sysfs entry after we release
> transaction_kthread, use-after-free problem might happen in this case
> 	Task1				Task2
> 	change Label by sysfs
> 					close_ctree
> 					  kthread_stop(transaction_kthread);
> 	  change label
> 	  wake_up(transaction_kthread)
>
>
> Thanks
> Miao
>
>> As for "it won't go r/o under us" - sb_want_write() will do that just fine.
>> .
>>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
@ 2015-01-30  5:30             ` Qu Wenruo
  0 siblings, 0 replies; 40+ messages in thread
From: Qu Wenruo @ 2015-01-30  5:30 UTC (permalink / raw)
  To: Miao Xie, Al Viro; +Cc: linux-btrfs, linux-fsdevel


-------- Original Message --------
Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function 
to get vfsmount from a given sb.
From: Miao Xie <miaoxie@huawei.com>
To: Al Viro <viro@ZenIV.linux.org.uk>, Qu Wenruo <quwenruo@cn.fujitsu.com>
Date: 2015年01月30日 12:14
> On Fri, 30 Jan 2015 02:14:45 +0000, Al Viro wrote:
>> On Fri, Jan 30, 2015 at 09:44:03AM +0800, Qu Wenruo wrote:
>>
>>> This shouldn't happen. If someone is ro, the whole fs should be ro, right?
>> Wrong.  Individual vfsmounts over an r/w superblock might very well be r/o.
>> As for that trylock...  What for?  It invites transient failures for no
>> good reason.  Removal of sysfs entry will block while write(2) to that sucker
>> is in progress, so btrfs shutdown will block at that point in ctree_close().
>> It won't go away under you.
> could you explain the race condition? I think the deadlock won't happen, during
> the btrfs shutdown, we hold s_umount, the write operation will fail to lock it,
> and quit quickly, and then umount will continue.
>
> I think sb_want_write() is similar to trylock(s_umount), the difference is that
> sb_want_write() is more complex.
How?
sb_want_write() should be much like mnt_want_write(), except no need to 
increase vfsmout ref count things
and no need to check per mount ro/rw things.

Thanks,
Qu
>
>> Now, you might want to move those sysfs entry removals to the very beginning
>> of btrfs_kill_super(), but that's a different story - you need only to make
>> sure that they are removed not later than the destruction of the data
>> structures they need (IOW, the current location might very well be OK - I
>> hadn't checked the details).
> Yes, we need move those sysfs entry removals, but needn't move to the very
> beginning of btrfs_kill_super(), just at the beginning of close_ctree();
>
> The current location is not right, it will introduce the use-after-free
> problem. because we remove the sysfs entry after we release
> transaction_kthread, use-after-free problem might happen in this case
> 	Task1				Task2
> 	change Label by sysfs
> 					close_ctree
> 					  kthread_stop(transaction_kthread);
> 	  change label
> 	  wake_up(transaction_kthread)
>
>
> Thanks
> Miao
>
>> As for "it won't go r/o under us" - sb_want_write() will do that just fine.
>> .
>>

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
  2015-01-30  4:37           ` Al Viro
@ 2015-01-30  5:34             ` Miao Xie
  2015-01-30  6:15               ` Al Viro
  0 siblings, 1 reply; 40+ messages in thread
From: Miao Xie @ 2015-01-30  5:34 UTC (permalink / raw)
  To: Al Viro; +Cc: Qu Wenruo, linux-btrfs, linux-fsdevel

On Fri, 30 Jan 2015 04:37:14 +0000, Al Viro wrote:
> On Fri, Jan 30, 2015 at 12:14:24PM +0800, Miao Xie wrote:
>> On Fri, 30 Jan 2015 02:14:45 +0000, Al Viro wrote:
>>> On Fri, Jan 30, 2015 at 09:44:03AM +0800, Qu Wenruo wrote:
>>>
>>>> This shouldn't happen. If someone is ro, the whole fs should be ro, right?
>>>
>>> Wrong.  Individual vfsmounts over an r/w superblock might very well be r/o.
>>> As for that trylock...  What for?  It invites transient failures for no
>>> good reason.  Removal of sysfs entry will block while write(2) to that sucker
>>> is in progress, so btrfs shutdown will block at that point in ctree_close().
>>> It won't go away under you.
>>
>> could you explain the race condition? I think the deadlock won't happen, during
>> the btrfs shutdown, we hold s_umount, the write operation will fail to lock it,
>> and quit quickly, and then umount will continue.
> 
> 	First of all, ->s_umount is not a mutex; it's rwsem.  So you mean
> down_read_trylock().  As for the transient failures - grep for down_write
> on it...  E.g. have somebody call mount() from the same device.  We call
> sget(), which finds existing superblock and calls grab_super().  Sure,
> that ->s_umount will be released shortly, but in the meanwhile your trylock
> will fail...

I know it, so I suggested to return -EBUSY in the previous mail.
I think it is acceptable method, mount/umount operations are not so many
after all.

Thanks
Miao

> 
>> I think sb_want_write() is similar to trylock(s_umount), the difference is that
>> sb_want_write() is more complex.
>>
>>>
>>> Now, you might want to move those sysfs entry removals to the very beginning
>>> of btrfs_kill_super(), but that's a different story - you need only to make
>>> sure that they are removed not later than the destruction of the data
>>> structures they need (IOW, the current location might very well be OK - I
>>> hadn't checked the details).
>>
>> Yes, we need move those sysfs entry removals, but needn't move to the very
>> beginning of btrfs_kill_super(), just at the beginning of close_ctree();
> 
> So move them...  It's a matter of moving one function call around a bit.
> .
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
  2015-01-30  5:34             ` Miao Xie
@ 2015-01-30  6:15               ` Al Viro
  0 siblings, 0 replies; 40+ messages in thread
From: Al Viro @ 2015-01-30  6:15 UTC (permalink / raw)
  To: Miao Xie; +Cc: Qu Wenruo, linux-btrfs, linux-fsdevel

On Fri, Jan 30, 2015 at 01:34:41PM +0800, Miao Xie wrote:
> > 	First of all, ->s_umount is not a mutex; it's rwsem.  So you mean
> > down_read_trylock().  As for the transient failures - grep for down_write
> > on it...  E.g. have somebody call mount() from the same device.  We call
> > sget(), which finds existing superblock and calls grab_super().  Sure,
> > that ->s_umount will be released shortly, but in the meanwhile your trylock
> > will fail...
> 
> I know it, so I suggested to return -EBUSY in the previous mail.
> I think it is acceptable method, mount/umount operations are not so many
> after all.

Er...  What for, when there's a trivial variant that doesn't suffer those
spurious -EBUSY?  Seriously, just move sysfs removals up to make sure
that any possible fs shutdown won't progress past those during sysfs IO and
use sb_want_write/sb_dont_write to make sure it'll stay r/w for the duration.
What's the problem?

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2015-01-30  6:15 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-29  2:24 [PATCH RESEND v4 0/8] Fix freeze/sysfs deadlock in better method Qu Wenruo
2015-01-29  2:24 ` [PATCH RESEND v4 1/8] Revert "btrfs: add support for processing pending changes" related commits Qu Wenruo
2015-01-29  2:24 ` [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount option in a atomic way Qu Wenruo
2015-01-30  0:31   ` Miao Xie
2015-01-30  1:20     ` Qu Wenruo
2015-01-30  1:29       ` Miao Xie
2015-01-30  1:33         ` Qu Wenruo
2015-01-30  2:06           ` Miao Xie
2015-01-30  2:51             ` Qu Wenruo
2015-01-30  3:21               ` Miao Xie
2015-01-30  3:25                 ` Qu Wenruo
2015-01-29  2:24 ` [PATCH RESEND v4 3/8] btrfs: Introduce per-transaction mount_opt to keep mount option consistent during transaction Qu Wenruo
2015-01-29  2:24 ` [PATCH RESEND v4 4/8] btrfs: Use btrfs_test_trans_opt() to handle SPACE_CACHE if it's under transaction protect Qu Wenruo
2015-01-29  2:24 ` [PATCH RESEND v4 5/8] btrfs: Use btrfs_test_trans_opt() to handle INODE_CACHE " Qu Wenruo
2015-01-29  2:24 ` [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb Qu Wenruo
2015-01-29 12:37   ` David Sterba
2015-01-29 15:23     ` Al Viro
2015-01-30  1:11       ` Qu Wenruo
2015-01-30  1:11         ` Qu Wenruo
2015-01-30  2:09         ` Al Viro
2015-01-30  2:20           ` Qu Wenruo
2015-01-30  2:20             ` Qu Wenruo
2015-01-30  0:52   ` Miao Xie
2015-01-30  1:44     ` Qu Wenruo
2015-01-30  1:44       ` Qu Wenruo
2015-01-30  2:02       ` Qu Wenruo
2015-01-30  2:02         ` Qu Wenruo
2015-01-30  3:22         ` Miao Xie
2015-01-30  3:22           ` Miao Xie
2015-01-30  3:30           ` Qu Wenruo
2015-01-30  3:30             ` Qu Wenruo
2015-01-30  2:14       ` Al Viro
2015-01-30  4:14         ` Miao Xie
2015-01-30  4:37           ` Al Viro
2015-01-30  5:34             ` Miao Xie
2015-01-30  6:15               ` Al Viro
2015-01-30  5:30           ` Qu Wenruo
2015-01-30  5:30             ` Qu Wenruo
2015-01-29  2:24 ` [PATCH RESEND v4 7/8] btrfs: Use mnt_want_write() to protect label change Qu Wenruo
2015-01-29  2:24 ` [PATCH RESEND v4 8/8] btrfs: Use mnt_want_write() to protect sysfs feature change Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.