linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] btrfs: fix some data races during fsync and cleanups
@ 2023-10-04 10:38 fdmanana
  2023-10-04 10:38 ` [PATCH 1/6] btrfs: add and use helpers for reading and writing last_log_commit fdmanana
                   ` (6 more replies)
  0 siblings, 7 replies; 10+ messages in thread
From: fdmanana @ 2023-10-04 10:38 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

The following remove some data races affecting mostly fsync. In general
these are mostly harmless from a functional perspective, though there are
a few cases that could cause some unexpected results but should be very
rare to hit. There's also a couple cleanups here.
More details on the changelogs.

Filipe Manana (6):
  btrfs: add and use helpers for reading and writing last_log_commit
  btrfs: add and use helpers for reading and writing log_transid
  btrfs: add and use helpers for reading and writing fs_info->generation
  btrfs: add and use helpers for reading and writing last_trans_committed
  btrfs: remove pointless barrier from btrfs_sync_file()
  btrfs: update comment for struct btrfs_inode::lock

 fs/btrfs/btrfs_inode.h  | 34 +++++++++++++++++++---------------
 fs/btrfs/ctree.h        | 34 +++++++++++++++++++++++++++++++++-
 fs/btrfs/disk-io.c      | 17 +++++++++--------
 fs/btrfs/file.c         |  5 ++---
 fs/btrfs/fs.h           | 30 ++++++++++++++++++++++++++++++
 fs/btrfs/inode.c        |  4 ++--
 fs/btrfs/ioctl.c        |  4 ++--
 fs/btrfs/scrub.c        |  2 +-
 fs/btrfs/super.c        |  7 ++++---
 fs/btrfs/sysfs.c        |  2 +-
 fs/btrfs/transaction.c  |  8 ++++----
 fs/btrfs/transaction.h  |  2 +-
 fs/btrfs/tree-checker.c |  2 +-
 fs/btrfs/tree-log.c     |  6 +++---
 14 files changed, 112 insertions(+), 45 deletions(-)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/6] btrfs: add and use helpers for reading and writing last_log_commit
  2023-10-04 10:38 [PATCH 0/6] btrfs: fix some data races during fsync and cleanups fdmanana
@ 2023-10-04 10:38 ` fdmanana
  2023-10-04 10:38 ` [PATCH 2/6] btrfs: add and use helpers for reading and writing log_transid fdmanana
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: fdmanana @ 2023-10-04 10:38 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Currently, the last_log_commit of a root can be accessed concurrently
without any lock protection. Readers can be calling btrfs_inode_in_log()
early in a fsync call, which reads a root's last_log_commit, while a
writer can change the last_log_commit while a log tree if being synced,
at btrfs_sync_log(). Any races here should be harmless, and in the worst
case they may cause a fsync to log an inode when it's not really needed,
so nothing bad from a functional perspective.

To avoid data race warnings from tools like KCSAN and other issues such
as load and store tearing (amongst others, see [1]), create helpers to
access the last_log_commit field of a root using READ_ONCE() and
WRITE_ONCE(), and use these helpers everywhere.

[1] https://lwn.net/Articles/793253/

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/btrfs_inode.h |  2 +-
 fs/btrfs/ctree.h       | 16 +++++++++++++++-
 fs/btrfs/disk-io.c     |  4 ++--
 fs/btrfs/tree-log.c    |  4 ++--
 4 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 81bf514d988f..d32ef248828e 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -390,7 +390,7 @@ static inline bool btrfs_inode_in_log(struct btrfs_inode *inode, u64 generation)
 	spin_lock(&inode->lock);
 	if (inode->logged_trans == generation &&
 	    inode->last_sub_trans <= inode->last_log_commit &&
-	    inode->last_sub_trans <= inode->root->last_log_commit)
+	    inode->last_sub_trans <= btrfs_get_root_last_log_commit(inode->root))
 		ret = true;
 	spin_unlock(&inode->lock);
 	return ret;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 208a1888ca07..3ebb5229660a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -194,7 +194,11 @@ struct btrfs_root {
 	int log_transid;
 	/* No matter the commit succeeds or not*/
 	int log_transid_committed;
-	/* Just be updated when the commit succeeds. */
+	/*
+	 * Just be updated when the commit succeeds. Use
+	 * btrfs_get_root_last_log_commit() and btrfs_set_root_last_log_commit()
+	 * to access this field.
+	 */
 	int last_log_commit;
 	pid_t log_start_pid;
 
@@ -328,6 +332,16 @@ static inline u64 btrfs_root_id(const struct btrfs_root *root)
 	return root->root_key.objectid;
 }
 
+static inline int btrfs_get_root_last_log_commit(const struct btrfs_root *root)
+{
+	return READ_ONCE(root->last_log_commit);
+}
+
+static inline void btrfs_set_root_last_log_commit(struct btrfs_root *root, int commit_id)
+{
+	WRITE_ONCE(root->last_log_commit, commit_id);
+}
+
 /*
  * Structure that conveys information about an extent that is going to replace
  * all the extents in a file range.
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index ff3802986b3e..fe18c54cec10 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -677,7 +677,7 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
 	atomic_set(&root->nr_swapfiles, 0);
 	root->log_transid = 0;
 	root->log_transid_committed = -1;
-	root->last_log_commit = 0;
+	btrfs_set_root_last_log_commit(root, 0);
 	root->anon_dev = 0;
 	if (!dummy) {
 		extent_io_tree_init(fs_info, &root->dirty_log_pages,
@@ -1006,7 +1006,7 @@ int btrfs_add_log_tree(struct btrfs_trans_handle *trans,
 	root->log_root = log_root;
 	root->log_transid = 0;
 	root->log_transid_committed = -1;
-	root->last_log_commit = 0;
+	btrfs_set_root_last_log_commit(root, 0);
 	return 0;
 }
 
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 2c4685316c43..28a61a7dd371 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3133,8 +3133,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	 * someone else already started it. We use <= and not < because the
 	 * first log transaction has an ID of 0.
 	 */
-	ASSERT(root->last_log_commit <= log_transid);
-	root->last_log_commit = log_transid;
+	ASSERT(btrfs_get_root_last_log_commit(root) <= log_transid);
+	btrfs_set_root_last_log_commit(root, log_transid);
 
 out_wake_log_root:
 	mutex_lock(&log_root_tree->log_mutex);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/6] btrfs: add and use helpers for reading and writing log_transid
  2023-10-04 10:38 [PATCH 0/6] btrfs: fix some data races during fsync and cleanups fdmanana
  2023-10-04 10:38 ` [PATCH 1/6] btrfs: add and use helpers for reading and writing last_log_commit fdmanana
@ 2023-10-04 10:38 ` fdmanana
  2023-10-04 10:38 ` [PATCH 3/6] btrfs: add and use helpers for reading and writing fs_info->generation fdmanana
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: fdmanana @ 2023-10-04 10:38 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Currently the log_transid field of a root is always modified while holding
the root's log_mutex locked. Most readers of a root's log_transid are also
holding the root's log_mutex locked, however there is one exception which
is btrfs_set_inode_last_trans() where we don't take the lock to avoid
blocking several operations if log syncing is happening in parallel.

Any races here should be harmless, and in the worst case they may cause a
fsync to log an inode when it's not really needed, so nothing bad from a
functional perspective.

To avoid data race warnings from tools like KCSAN and other issues such
as load and store tearing (amongst others, see [1]), create helpers to
access the log_transid field of a root using READ_ONCE() and WRITE_ONCE(),
and use these helpers where needed.

[1] https://lwn.net/Articles/793253/

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ctree.h       | 18 ++++++++++++++++++
 fs/btrfs/disk-io.c     |  4 ++--
 fs/btrfs/transaction.h |  2 +-
 fs/btrfs/tree-log.c    |  2 +-
 4 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 3ebb5229660a..c8f1d2d7c46c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -191,6 +191,14 @@ struct btrfs_root {
 	atomic_t log_commit[2];
 	/* Used only for log trees of subvolumes, not for the log root tree */
 	atomic_t log_batch;
+	/*
+	 * Protected by the 'log_mutex' lock but can be read without holding
+	 * that lock to avoid unnecessary lock contention, in which case it
+	 * should be read using btrfs_get_root_log_transid() except if it's a
+	 * log tree in which case it can be directly accessed. Updates to this
+	 * field should always use btrfs_set_root_log_transid(), except for log
+	 * trees where the field can be updated directly.
+	 */
 	int log_transid;
 	/* No matter the commit succeeds or not*/
 	int log_transid_committed;
@@ -332,6 +340,16 @@ static inline u64 btrfs_root_id(const struct btrfs_root *root)
 	return root->root_key.objectid;
 }
 
+static inline int btrfs_get_root_log_transid(const struct btrfs_root *root)
+{
+	return READ_ONCE(root->log_transid);
+}
+
+static inline void btrfs_set_root_log_transid(struct btrfs_root *root, int log_transid)
+{
+	WRITE_ONCE(root->log_transid, log_transid);
+}
+
 static inline int btrfs_get_root_last_log_commit(const struct btrfs_root *root)
 {
 	return READ_ONCE(root->last_log_commit);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index fe18c54cec10..c84d32951b26 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -675,7 +675,7 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
 	refcount_set(&root->refs, 1);
 	atomic_set(&root->snapshot_force_cow, 0);
 	atomic_set(&root->nr_swapfiles, 0);
-	root->log_transid = 0;
+	btrfs_set_root_log_transid(root, 0);
 	root->log_transid_committed = -1;
 	btrfs_set_root_last_log_commit(root, 0);
 	root->anon_dev = 0;
@@ -1004,7 +1004,7 @@ int btrfs_add_log_tree(struct btrfs_trans_handle *trans,
 
 	WARN_ON(root->log_root);
 	root->log_root = log_root;
-	root->log_transid = 0;
+	btrfs_set_root_log_transid(root, 0);
 	root->log_transid_committed = -1;
 	btrfs_set_root_last_log_commit(root, 0);
 	return 0;
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index f5bf3489d8c5..ff7580bbcf4e 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -175,7 +175,7 @@ static inline void btrfs_set_inode_last_trans(struct btrfs_trans_handle *trans,
 {
 	spin_lock(&inode->lock);
 	inode->last_trans = trans->transaction->transid;
-	inode->last_sub_trans = inode->root->log_transid;
+	inode->last_sub_trans = btrfs_get_root_log_transid(inode->root);
 	inode->last_log_commit = inode->last_sub_trans - 1;
 	spin_unlock(&inode->lock);
 }
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 28a61a7dd371..958bb23d3d99 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2958,7 +2958,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	btrfs_set_root_node(&log->root_item, log->node);
 	memcpy(&new_root_item, &log->root_item, sizeof(new_root_item));
 
-	root->log_transid++;
+	btrfs_set_root_log_transid(root, root->log_transid + 1);
 	log->log_transid = root->log_transid;
 	root->log_start_pid = 0;
 	/*
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/6] btrfs: add and use helpers for reading and writing fs_info->generation
  2023-10-04 10:38 [PATCH 0/6] btrfs: fix some data races during fsync and cleanups fdmanana
  2023-10-04 10:38 ` [PATCH 1/6] btrfs: add and use helpers for reading and writing last_log_commit fdmanana
  2023-10-04 10:38 ` [PATCH 2/6] btrfs: add and use helpers for reading and writing log_transid fdmanana
@ 2023-10-04 10:38 ` fdmanana
  2023-10-06 14:16   ` David Sterba
  2023-10-04 10:38 ` [PATCH 4/6] btrfs: add and use helpers for reading and writing last_trans_committed fdmanana
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 10+ messages in thread
From: fdmanana @ 2023-10-04 10:38 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Currently the generation field of struct btrfs_fs_info is always modified
while holding fs_info->trans_lock locked. Most readers will access this
field without taking that lock but while holding a transaction handle,
which is safe to do due to the transaction life cycle.

However there are other readers that are neither holding the lock nor
holding a transaction handle open:

1) When reading an inode from disk, at btrfs_read_locked_inode();

2) When reading the generation to expose it to sysfs, at
   btrfs_generation_show();

3) Early in the fsync path, at skip_inode_logging();

4) When creating a hole at btrfs_cont_expand(), during write paths,
   truncate and reflinking;

5) In the fs_info ioctl (btrfs_ioctl_fs_info());

6) While mounting the filesystem, in the open_ctree() path. In these
   cases it's safe to directly read fs_info->generation as no one
   can concurrently start a transaction and update fs_info->generation.

In case of the fsync path, races here should be harmless, and in the worst
case they may cause a fsync to log an inode when it's not really needed,
so nothing bad from a functional perspective. In the other cases it's not
so clear if functional problems may arise, though in case 1 rare things
like a load/store tearing [1] may cause the BTRFS_INODE_NEEDS_FULL_SYNC
flag not being set on an inode and therefore result in incorrect logging
later on in case a fsync call is made.

To avoid data race warnings from tools like KCSAN and other issues such
as load and store tearing (amongst others, see [1]), create helpers to
access the generation field of struct btrfs_fs_info using READ_ONCE() and
WRITE_ONCE(), and use these helpers where needed.

[1] https://lwn.net/Articles/793253/

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/file.c        |  2 +-
 fs/btrfs/fs.h          | 16 ++++++++++++++++
 fs/btrfs/inode.c       |  4 ++--
 fs/btrfs/ioctl.c       |  2 +-
 fs/btrfs/sysfs.c       |  2 +-
 fs/btrfs/transaction.c |  2 +-
 6 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 004c53482f05..723f0c70452e 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1749,7 +1749,7 @@ static inline bool skip_inode_logging(const struct btrfs_log_ctx *ctx)
 	struct btrfs_inode *inode = BTRFS_I(ctx->inode);
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 
-	if (btrfs_inode_in_log(inode, fs_info->generation) &&
+	if (btrfs_inode_in_log(inode, btrfs_get_fs_generation(fs_info)) &&
 	    list_empty(&ctx->ordered_extents))
 		return true;
 
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 2bd9bedc7095..d04b729cbdf3 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -416,6 +416,12 @@ struct btrfs_fs_info {
 
 	struct btrfs_block_rsv empty_block_rsv;
 
+	/*
+	 * Updated while holding the lock 'trans_lock'. Due to the life cycle of
+	 * a transaction, it can be directly read while holding a transaction
+	 * handle, everywhere else must be read with btrfs_get_fs_generation().
+	 * Should always be updated using btrfs_set_fs_generation().
+	 */
 	u64 generation;
 	u64 last_trans_committed;
 	/*
@@ -817,6 +823,16 @@ struct btrfs_fs_info {
 #endif
 };
 
+static inline u64 btrfs_get_fs_generation(const struct btrfs_fs_info *fs_info)
+{
+	return READ_ONCE(fs_info->generation);
+}
+
+static inline void btrfs_set_fs_generation(struct btrfs_fs_info *fs_info, u64 gen)
+{
+	WRITE_ONCE(fs_info->generation, gen);
+}
+
 static inline void btrfs_set_last_root_drop_gen(struct btrfs_fs_info *fs_info,
 						u64 gen)
 {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3b3aec302c33..c9317c047587 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3800,7 +3800,7 @@ static int btrfs_read_locked_inode(struct inode *inode,
 	 * This is required for both inode re-read from disk and delayed inode
 	 * in delayed_nodes_tree.
 	 */
-	if (BTRFS_I(inode)->last_trans == fs_info->generation)
+	if (BTRFS_I(inode)->last_trans == btrfs_get_fs_generation(fs_info))
 		set_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
 			&BTRFS_I(inode)->runtime_flags);
 
@@ -4923,7 +4923,7 @@ int btrfs_cont_expand(struct btrfs_inode *inode, loff_t oldsize, loff_t size)
 			hole_em->orig_block_len = 0;
 			hole_em->ram_bytes = hole_size;
 			hole_em->compress_type = BTRFS_COMPRESS_NONE;
-			hole_em->generation = fs_info->generation;
+			hole_em->generation = btrfs_get_fs_generation(fs_info);
 
 			err = btrfs_replace_extent_map_range(inode, hole_em, true);
 			free_extent_map(hole_em);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 848b7e6f6421..7ab21283fae8 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2822,7 +2822,7 @@ static long btrfs_ioctl_fs_info(struct btrfs_fs_info *fs_info,
 	}
 
 	if (flags_in & BTRFS_FS_INFO_FLAG_GENERATION) {
-		fi_args->generation = fs_info->generation;
+		fi_args->generation = btrfs_get_fs_generation(fs_info);
 		fi_args->flags |= BTRFS_FS_INFO_FLAG_GENERATION;
 	}
 
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index e07be193323a..21ab8b9b62ce 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1201,7 +1201,7 @@ static ssize_t btrfs_generation_show(struct kobject *kobj,
 {
 	struct btrfs_fs_info *fs_info = to_fs_info(kobj);
 
-	return sysfs_emit(buf, "%llu\n", fs_info->generation);
+	return sysfs_emit(buf, "%llu\n", btrfs_get_fs_generation(fs_info));
 }
 BTRFS_ATTR(, generation, btrfs_generation_show);
 
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 3aa59cfa4ab0..f5db3a483f40 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -386,7 +386,7 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info,
 			IO_TREE_TRANS_DIRTY_PAGES);
 	extent_io_tree_init(fs_info, &cur_trans->pinned_extents,
 			IO_TREE_FS_PINNED_EXTENTS);
-	fs_info->generation++;
+	btrfs_set_fs_generation(fs_info, fs_info->generation + 1);
 	cur_trans->transid = fs_info->generation;
 	fs_info->running_transaction = cur_trans;
 	cur_trans->aborted = 0;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/6] btrfs: add and use helpers for reading and writing last_trans_committed
  2023-10-04 10:38 [PATCH 0/6] btrfs: fix some data races during fsync and cleanups fdmanana
                   ` (2 preceding siblings ...)
  2023-10-04 10:38 ` [PATCH 3/6] btrfs: add and use helpers for reading and writing fs_info->generation fdmanana
@ 2023-10-04 10:38 ` fdmanana
  2023-10-04 10:38 ` [PATCH 5/6] btrfs: remove pointless barrier from btrfs_sync_file() fdmanana
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: fdmanana @ 2023-10-04 10:38 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Currently the last_trans_committed field of struct btrfs_fs_info is
modified and read without any locking or other protection. For example
early in the fsync path, skip_inode_logging() is called which reads
fs_info->last_trans_committed, but at the same time we can have a
transaction commit completing and updating that field.

In the case of an fsync this is harmless and any data race should be
rare and at most cause an unnecessary logging of an inode.

To avoid data race warnings from tools like KCSAN and other issues such
as load and store tearing (amongst others, see [1]), create helpers to
access the last_commited_trans field of struct btrfs_fs_info using
READ_ONCE() and WRITE_ONCE(), and use these helpers everywhere.

[1] https://lwn.net/Articles/793253/

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/disk-io.c      |  9 +++++----
 fs/btrfs/file.c         |  2 +-
 fs/btrfs/fs.h           | 14 ++++++++++++++
 fs/btrfs/ioctl.c        |  2 +-
 fs/btrfs/scrub.c        |  2 +-
 fs/btrfs/super.c        |  7 ++++---
 fs/btrfs/transaction.c  |  6 +++---
 fs/btrfs/tree-checker.c |  2 +-
 8 files changed, 30 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c84d32951b26..401ea09ae4b8 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -244,6 +244,7 @@ blk_status_t btree_csum_one_bio(struct btrfs_bio *bbio)
 	struct extent_buffer *eb = bbio->private;
 	struct btrfs_fs_info *fs_info = eb->fs_info;
 	u64 found_start = btrfs_header_bytenr(eb);
+	u64 last_trans;
 	u8 result[BTRFS_CSUM_SIZE];
 	int ret;
 
@@ -281,12 +282,12 @@ blk_status_t btree_csum_one_bio(struct btrfs_bio *bbio)
 	 * Also check the generation, the eb reached here must be newer than
 	 * last committed. Or something seriously wrong happened.
 	 */
-	if (unlikely(btrfs_header_generation(eb) <= fs_info->last_trans_committed)) {
+	last_trans = btrfs_get_last_trans_committed(fs_info);
+	if (unlikely(btrfs_header_generation(eb) <= last_trans)) {
 		ret = -EUCLEAN;
 		btrfs_err(fs_info,
 			"block=%llu bad generation, have %llu expect > %llu",
-			  eb->start, btrfs_header_generation(eb),
-			  fs_info->last_trans_committed);
+			  eb->start, btrfs_header_generation(eb), last_trans);
 		goto error;
 	}
 	write_extent_buffer(eb, result, 0, fs_info->csum_size);
@@ -2653,7 +2654,7 @@ static int __cold init_tree_roots(struct btrfs_fs_info *fs_info)
 
 		/* All successful */
 		fs_info->generation = btrfs_header_generation(tree_root->node);
-		fs_info->last_trans_committed = fs_info->generation;
+		btrfs_set_last_trans_committed(fs_info, fs_info->generation);
 		fs_info->last_reloc_trans = 0;
 
 		/* Always begin writing backup roots after the one being used */
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 723f0c70452e..92e6f224bff9 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1760,7 +1760,7 @@ static inline bool skip_inode_logging(const struct btrfs_log_ctx *ctx)
 	 * and for a fast fsync we don't wait for that, we only wait for the
 	 * writeback to complete.
 	 */
-	if (inode->last_trans <= fs_info->last_trans_committed &&
+	if (inode->last_trans <= btrfs_get_last_trans_committed(fs_info) &&
 	    (test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags) ||
 	     list_empty(&ctx->ordered_extents)))
 		return true;
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index d04b729cbdf3..318df6f9d9cb 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -423,6 +423,10 @@ struct btrfs_fs_info {
 	 * Should always be updated using btrfs_set_fs_generation().
 	 */
 	u64 generation;
+	/*
+	 * Always use btrfs_get_last_trans_committed() and
+	 * btrfs_set_last_trans_committed() to read and update this field.
+	 */
 	u64 last_trans_committed;
 	/*
 	 * Generation of the last transaction used for block group relocation
@@ -833,6 +837,16 @@ static inline void btrfs_set_fs_generation(struct btrfs_fs_info *fs_info, u64 ge
 	WRITE_ONCE(fs_info->generation, gen);
 }
 
+static inline u64 btrfs_get_last_trans_committed(const struct btrfs_fs_info *fs_info)
+{
+	return READ_ONCE(fs_info->last_trans_committed);
+}
+
+static inline void btrfs_set_last_trans_committed(struct btrfs_fs_info *fs_info, u64 gen)
+{
+	WRITE_ONCE(fs_info->last_trans_committed, gen);
+}
+
 static inline void btrfs_set_last_root_drop_gen(struct btrfs_fs_info *fs_info,
 						u64 gen)
 {
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 7ab21283fae8..ab4ddbebeb16 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3131,7 +3131,7 @@ static noinline long btrfs_ioctl_start_sync(struct btrfs_root *root,
 			return PTR_ERR(trans);
 
 		/* No running transaction, don't bother */
-		transid = root->fs_info->last_trans_committed;
+		transid = btrfs_get_last_trans_committed(root->fs_info);
 		goto out;
 	}
 	transid = trans->transid;
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index c477a14f1281..3e14f9cec19b 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2769,7 +2769,7 @@ static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx,
 	if (scrub_dev->fs_devices != fs_info->fs_devices)
 		gen = scrub_dev->generation;
 	else
-		gen = fs_info->last_trans_committed;
+		gen = btrfs_get_last_trans_committed(fs_info);
 
 	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
 		bytenr = btrfs_sb_offset(i);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 65883f5c488b..cc326969751f 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2221,6 +2221,7 @@ static int check_dev_super(struct btrfs_device *dev)
 {
 	struct btrfs_fs_info *fs_info = dev->fs_info;
 	struct btrfs_super_block *sb;
+	u64 last_trans;
 	u16 csum_type;
 	int ret = 0;
 
@@ -2256,10 +2257,10 @@ static int check_dev_super(struct btrfs_device *dev)
 	if (ret < 0)
 		goto out;
 
-	if (btrfs_super_generation(sb) != fs_info->last_trans_committed) {
+	last_trans = btrfs_get_last_trans_committed(fs_info);
+	if (btrfs_super_generation(sb) != last_trans) {
 		btrfs_err(fs_info, "transid mismatch, has %llu expect %llu",
-			btrfs_super_generation(sb),
-			fs_info->last_trans_committed);
+			  btrfs_super_generation(sb), last_trans);
 		ret = -EUCLEAN;
 		goto out;
 	}
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index f5db3a483f40..9694a3ca1739 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -980,7 +980,7 @@ int btrfs_wait_for_commit(struct btrfs_fs_info *fs_info, u64 transid)
 	int ret = 0;
 
 	if (transid) {
-		if (transid <= fs_info->last_trans_committed)
+		if (transid <= btrfs_get_last_trans_committed(fs_info))
 			goto out;
 
 		/* find specified transaction */
@@ -1004,7 +1004,7 @@ int btrfs_wait_for_commit(struct btrfs_fs_info *fs_info, u64 transid)
 		 * raced with btrfs_commit_transaction
 		 */
 		if (!cur_trans) {
-			if (transid > fs_info->last_trans_committed)
+			if (transid > btrfs_get_last_trans_committed(fs_info))
 				ret = -EINVAL;
 			goto out;
 		}
@@ -2587,7 +2587,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 	if (test_bit(BTRFS_TRANS_HAVE_FREE_BGS, &cur_trans->flags))
 		btrfs_clear_space_info_full(fs_info);
 
-	fs_info->last_trans_committed = cur_trans->transid;
+	btrfs_set_last_trans_committed(fs_info, cur_trans->transid);
 	/*
 	 * We needn't acquire the lock here because there is no other task
 	 * which can change it.
diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index 1f2c389b0bfa..a416cbea75d1 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -2051,7 +2051,7 @@ int btrfs_verify_level_key(struct extent_buffer *eb, int level,
 	 * So we only checks tree blocks which is read from disk, whose
 	 * generation <= fs_info->last_trans_committed.
 	 */
-	if (btrfs_header_generation(eb) > fs_info->last_trans_committed)
+	if (btrfs_header_generation(eb) > btrfs_get_last_trans_committed(fs_info))
 		return 0;
 
 	/* We have @first_key, so this @eb must have at least one item */
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 5/6] btrfs: remove pointless barrier from btrfs_sync_file()
  2023-10-04 10:38 [PATCH 0/6] btrfs: fix some data races during fsync and cleanups fdmanana
                   ` (3 preceding siblings ...)
  2023-10-04 10:38 ` [PATCH 4/6] btrfs: add and use helpers for reading and writing last_trans_committed fdmanana
@ 2023-10-04 10:38 ` fdmanana
  2023-10-04 10:38 ` [PATCH 6/6] btrfs: update comment for struct btrfs_inode::lock fdmanana
  2023-10-06 14:29 ` [PATCH 0/6] btrfs: fix some data races during fsync and cleanups David Sterba
  6 siblings, 0 replies; 10+ messages in thread
From: fdmanana @ 2023-10-04 10:38 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

The memory barrier (smp_mb()) at btrfs_sync_file() is completely redundant
now that fs_info->last_trans_committed is read using READ_ONCE(), with the
helper btrfs_get_last_trans_committed(), and written using WRITE_ONCE()
with the helper btrfs_set_last_trans_committed().

This barrier was introduced in 2011, by commit a4abeea41adf ("Btrfs: kill
trans_mutex"), but even back then it was not correct since the writer side
(in btrfs_commit_transaction()), did not issue a pairing memory barrier
after it updated fs_info->last_trans_committed.

So remove this barrier.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/file.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 92e6f224bff9..92419cb8508a 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1889,7 +1889,6 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 
 	atomic_inc(&root->log_batch);
 
-	smp_mb();
 	if (skip_inode_logging(&ctx)) {
 		/*
 		 * We've had everything committed since the last time we were
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 6/6] btrfs: update comment for struct btrfs_inode::lock
  2023-10-04 10:38 [PATCH 0/6] btrfs: fix some data races during fsync and cleanups fdmanana
                   ` (4 preceding siblings ...)
  2023-10-04 10:38 ` [PATCH 5/6] btrfs: remove pointless barrier from btrfs_sync_file() fdmanana
@ 2023-10-04 10:38 ` fdmanana
  2023-10-06 14:29 ` [PATCH 0/6] btrfs: fix some data races during fsync and cleanups David Sterba
  6 siblings, 0 replies; 10+ messages in thread
From: fdmanana @ 2023-10-04 10:38 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Update the comment for the lock named "lock" in struct btrfs_inode because
it does not mention that the fields "delalloc_bytes", "defrag_bytes",
"csum_bytes", "outstanding_extents" and "disk_i_size" are also protected
by that lock.

Also add a comment on top of each field protected by this lock to mention
that the lock protects them.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/btrfs_inode.h | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index d32ef248828e..bebb5921b922 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -93,8 +93,9 @@ struct btrfs_inode {
 	/*
 	 * Lock for counters and all fields used to determine if the inode is in
 	 * the log or not (last_trans, last_sub_trans, last_log_commit,
-	 * logged_trans), to access/update new_delalloc_bytes and to update the
-	 * VFS' inode number of bytes used.
+	 * logged_trans), to access/update delalloc_bytes, new_delalloc_bytes,
+	 * defrag_bytes, disk_i_size, outstanding_extents, csum_bytes and to
+	 * update the VFS' inode number of bytes used.
 	 */
 	spinlock_t lock;
 
@@ -117,7 +118,7 @@ struct btrfs_inode {
 	 * Counters to keep track of the number of extent item's we may use due
 	 * to delalloc and such.  outstanding_extents is the number of extent
 	 * items we think we'll end up using, and reserved_extents is the number
-	 * of extent items we've reserved metadata for.
+	 * of extent items we've reserved metadata for. Protected by 'lock'.
 	 */
 	unsigned outstanding_extents;
 
@@ -143,28 +144,31 @@ struct btrfs_inode {
 	u64 generation;
 
 	/*
-	 * transid of the trans_handle that last modified this inode
+	 * ID of the transaction handle that last modified this inode.
+	 * Protected by 'lock'.
 	 */
 	u64 last_trans;
 
 	/*
-	 * transid that last logged this inode
+	 * ID of the transaction that last logged this inode.
+	 * Protected by 'lock'.
 	 */
 	u64 logged_trans;
 
 	/*
-	 * log transid when this inode was last modified
+	 * Log transaction ID when this inode was last modified.
+	 * Protected by 'lock'.
 	 */
 	int last_sub_trans;
 
-	/* a local copy of root's last_log_commit */
+	/* A local copy of root's last_log_commit. Protected by 'lock'. */
 	int last_log_commit;
 
 	union {
 		/*
 		 * Total number of bytes pending delalloc, used by stat to
 		 * calculate the real block usage of the file. This is used
-		 * only for files.
+		 * only for files. Protected by 'lock'.
 		 */
 		u64 delalloc_bytes;
 		/*
@@ -182,7 +186,7 @@ struct btrfs_inode {
 		 * Total number of bytes pending delalloc that fall within a file
 		 * range that is either a hole or beyond EOF (and no prealloc extent
 		 * exists in the range). This is always <= delalloc_bytes and this
-		 * is used only for files.
+		 * is used only for files. Protected by 'lock'.
 		 */
 		u64 new_delalloc_bytes;
 		/*
@@ -193,15 +197,15 @@ struct btrfs_inode {
 	};
 
 	/*
-	 * total number of bytes pending defrag, used by stat to check whether
-	 * it needs COW.
+	 * Total number of bytes pending defrag, used by stat to check whether
+	 * it needs COW. Protected by 'lock'.
 	 */
 	u64 defrag_bytes;
 
 	/*
-	 * the size of the file stored in the metadata on disk.  data=ordered
+	 * The size of the file stored in the metadata on disk.  data=ordered
 	 * means the in-memory i_size might be larger than the size on disk
-	 * because not all the blocks are written yet.
+	 * because not all the blocks are written yet. Protected by 'lock'.
 	 */
 	u64 disk_i_size;
 
@@ -235,7 +239,7 @@ struct btrfs_inode {
 
 	/*
 	 * Number of bytes outstanding that are going to need csums.  This is
-	 * used in ENOSPC accounting.
+	 * used in ENOSPC accounting. Protected by 'lock'.
 	 */
 	u64 csum_bytes;
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/6] btrfs: add and use helpers for reading and writing fs_info->generation
  2023-10-04 10:38 ` [PATCH 3/6] btrfs: add and use helpers for reading and writing fs_info->generation fdmanana
@ 2023-10-06 14:16   ` David Sterba
  2023-10-06 14:42     ` Filipe Manana
  0 siblings, 1 reply; 10+ messages in thread
From: David Sterba @ 2023-10-06 14:16 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Wed, Oct 04, 2023 at 11:38:50AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Currently the generation field of struct btrfs_fs_info is always modified
> while holding fs_info->trans_lock locked. Most readers will access this
> field without taking that lock but while holding a transaction handle,
> which is safe to do due to the transaction life cycle.
> 
> However there are other readers that are neither holding the lock nor
> holding a transaction handle open:
> 
> 1) When reading an inode from disk, at btrfs_read_locked_inode();
> 
> 2) When reading the generation to expose it to sysfs, at
>    btrfs_generation_show();
> 
> 3) Early in the fsync path, at skip_inode_logging();
> 
> 4) When creating a hole at btrfs_cont_expand(), during write paths,
>    truncate and reflinking;
> 
> 5) In the fs_info ioctl (btrfs_ioctl_fs_info());
> 
> 6) While mounting the filesystem, in the open_ctree() path. In these
>    cases it's safe to directly read fs_info->generation as no one
>    can concurrently start a transaction and update fs_info->generation.
> 
> In case of the fsync path, races here should be harmless, and in the worst
> case they may cause a fsync to log an inode when it's not really needed,
> so nothing bad from a functional perspective. In the other cases it's not
> so clear if functional problems may arise, though in case 1 rare things
> like a load/store tearing [1] may cause the BTRFS_INODE_NEEDS_FULL_SYNC
> flag not being set on an inode and therefore result in incorrect logging
> later on in case a fsync call is made.
> 
> To avoid data race warnings from tools like KCSAN and other issues such
> as load and store tearing (amongst others, see [1]), create helpers to
> access the generation field of struct btrfs_fs_info using READ_ONCE() and
> WRITE_ONCE(), and use these helpers where needed.
> 
> [1] https://lwn.net/Articles/793253/
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/file.c        |  2 +-
>  fs/btrfs/fs.h          | 16 ++++++++++++++++
>  fs/btrfs/inode.c       |  4 ++--
>  fs/btrfs/ioctl.c       |  2 +-
>  fs/btrfs/sysfs.c       |  2 +-
>  fs/btrfs/transaction.c |  2 +-
>  6 files changed, 22 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 004c53482f05..723f0c70452e 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1749,7 +1749,7 @@ static inline bool skip_inode_logging(const struct btrfs_log_ctx *ctx)
>  	struct btrfs_inode *inode = BTRFS_I(ctx->inode);
>  	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>  
> -	if (btrfs_inode_in_log(inode, fs_info->generation) &&
> +	if (btrfs_inode_in_log(inode, btrfs_get_fs_generation(fs_info)) &&
>  	    list_empty(&ctx->ordered_extents))
>  		return true;
>  
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index 2bd9bedc7095..d04b729cbdf3 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -416,6 +416,12 @@ struct btrfs_fs_info {
>  
>  	struct btrfs_block_rsv empty_block_rsv;
>  
> +	/*
> +	 * Updated while holding the lock 'trans_lock'. Due to the life cycle of
> +	 * a transaction, it can be directly read while holding a transaction
> +	 * handle, everywhere else must be read with btrfs_get_fs_generation().
> +	 * Should always be updated using btrfs_set_fs_generation().
> +	 */
>  	u64 generation;
>  	u64 last_trans_committed;
>  	/*
> @@ -817,6 +823,16 @@ struct btrfs_fs_info {
>  #endif
>  };
>  
> +static inline u64 btrfs_get_fs_generation(const struct btrfs_fs_info *fs_info)
> +{
> +	return READ_ONCE(fs_info->generation);
> +}
> +
> +static inline void btrfs_set_fs_generation(struct btrfs_fs_info *fs_info, u64 gen)
> +{
> +	WRITE_ONCE(fs_info->generation, gen);
> +}
> +
>  static inline void btrfs_set_last_root_drop_gen(struct btrfs_fs_info *fs_info,
>  						u64 gen)
>  {
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 3b3aec302c33..c9317c047587 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3800,7 +3800,7 @@ static int btrfs_read_locked_inode(struct inode *inode,
>  	 * This is required for both inode re-read from disk and delayed inode
>  	 * in delayed_nodes_tree.
>  	 */
> -	if (BTRFS_I(inode)->last_trans == fs_info->generation)
> +	if (BTRFS_I(inode)->last_trans == btrfs_get_fs_generation(fs_info))
>  		set_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
>  			&BTRFS_I(inode)->runtime_flags);
>  
> @@ -4923,7 +4923,7 @@ int btrfs_cont_expand(struct btrfs_inode *inode, loff_t oldsize, loff_t size)
>  			hole_em->orig_block_len = 0;
>  			hole_em->ram_bytes = hole_size;
>  			hole_em->compress_type = BTRFS_COMPRESS_NONE;
> -			hole_em->generation = fs_info->generation;
> +			hole_em->generation = btrfs_get_fs_generation(fs_info);
>  
>  			err = btrfs_replace_extent_map_range(inode, hole_em, true);
>  			free_extent_map(hole_em);
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 848b7e6f6421..7ab21283fae8 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -2822,7 +2822,7 @@ static long btrfs_ioctl_fs_info(struct btrfs_fs_info *fs_info,
>  	}
>  
>  	if (flags_in & BTRFS_FS_INFO_FLAG_GENERATION) {
> -		fi_args->generation = fs_info->generation;
> +		fi_args->generation = btrfs_get_fs_generation(fs_info);
>  		fi_args->flags |= BTRFS_FS_INFO_FLAG_GENERATION;
>  	}
>  
> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> index e07be193323a..21ab8b9b62ce 100644
> --- a/fs/btrfs/sysfs.c
> +++ b/fs/btrfs/sysfs.c
> @@ -1201,7 +1201,7 @@ static ssize_t btrfs_generation_show(struct kobject *kobj,
>  {
>  	struct btrfs_fs_info *fs_info = to_fs_info(kobj);
>  
> -	return sysfs_emit(buf, "%llu\n", fs_info->generation);
> +	return sysfs_emit(buf, "%llu\n", btrfs_get_fs_generation(fs_info));
>  }
>  BTRFS_ATTR(, generation, btrfs_generation_show);
>  
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 3aa59cfa4ab0..f5db3a483f40 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -386,7 +386,7 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info,
>  			IO_TREE_TRANS_DIRTY_PAGES);
>  	extent_io_tree_init(fs_info, &cur_trans->pinned_extents,
>  			IO_TREE_FS_PINNED_EXTENTS);
> -	fs_info->generation++;
> +	btrfs_set_fs_generation(fs_info, fs_info->generation + 1);

Should this also use the helper for the part where it reads the value?

	btrfs_set_fs_generation(fs_info, btrfs_get_fs_generation(fs_info) + 1);

It's a matter of annotation not a functional fix, I don't know if KCSAN
would warn here. I'd expect that the unprotected access uses the helpers
consistently, ie. both or none.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/6] btrfs: fix some data races during fsync and cleanups
  2023-10-04 10:38 [PATCH 0/6] btrfs: fix some data races during fsync and cleanups fdmanana
                   ` (5 preceding siblings ...)
  2023-10-04 10:38 ` [PATCH 6/6] btrfs: update comment for struct btrfs_inode::lock fdmanana
@ 2023-10-06 14:29 ` David Sterba
  6 siblings, 0 replies; 10+ messages in thread
From: David Sterba @ 2023-10-06 14:29 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Wed, Oct 04, 2023 at 11:38:47AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> The following remove some data races affecting mostly fsync. In general
> these are mostly harmless from a functional perspective, though there are
> a few cases that could cause some unexpected results but should be very
> rare to hit. There's also a couple cleanups here.
> More details on the changelogs.
> 
> Filipe Manana (6):
>   btrfs: add and use helpers for reading and writing last_log_commit
>   btrfs: add and use helpers for reading and writing log_transid
>   btrfs: add and use helpers for reading and writing fs_info->generation
>   btrfs: add and use helpers for reading and writing last_trans_committed
>   btrfs: remove pointless barrier from btrfs_sync_file()
>   btrfs: update comment for struct btrfs_inode::lock

Added to misc-next as the changes are straightforward, if there's any
fixup needed for the increments I'll do that in the commits after we
agree. Thanks.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/6] btrfs: add and use helpers for reading and writing fs_info->generation
  2023-10-06 14:16   ` David Sterba
@ 2023-10-06 14:42     ` Filipe Manana
  0 siblings, 0 replies; 10+ messages in thread
From: Filipe Manana @ 2023-10-06 14:42 UTC (permalink / raw)
  To: dsterba; +Cc: linux-btrfs

On Fri, Oct 6, 2023 at 3:23 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Wed, Oct 04, 2023 at 11:38:50AM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Currently the generation field of struct btrfs_fs_info is always modified
> > while holding fs_info->trans_lock locked. Most readers will access this
> > field without taking that lock but while holding a transaction handle,
> > which is safe to do due to the transaction life cycle.
> >
> > However there are other readers that are neither holding the lock nor
> > holding a transaction handle open:
> >
> > 1) When reading an inode from disk, at btrfs_read_locked_inode();
> >
> > 2) When reading the generation to expose it to sysfs, at
> >    btrfs_generation_show();
> >
> > 3) Early in the fsync path, at skip_inode_logging();
> >
> > 4) When creating a hole at btrfs_cont_expand(), during write paths,
> >    truncate and reflinking;
> >
> > 5) In the fs_info ioctl (btrfs_ioctl_fs_info());
> >
> > 6) While mounting the filesystem, in the open_ctree() path. In these
> >    cases it's safe to directly read fs_info->generation as no one
> >    can concurrently start a transaction and update fs_info->generation.
> >
> > In case of the fsync path, races here should be harmless, and in the worst
> > case they may cause a fsync to log an inode when it's not really needed,
> > so nothing bad from a functional perspective. In the other cases it's not
> > so clear if functional problems may arise, though in case 1 rare things
> > like a load/store tearing [1] may cause the BTRFS_INODE_NEEDS_FULL_SYNC
> > flag not being set on an inode and therefore result in incorrect logging
> > later on in case a fsync call is made.
> >
> > To avoid data race warnings from tools like KCSAN and other issues such
> > as load and store tearing (amongst others, see [1]), create helpers to
> > access the generation field of struct btrfs_fs_info using READ_ONCE() and
> > WRITE_ONCE(), and use these helpers where needed.
> >
> > [1] https://lwn.net/Articles/793253/
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >  fs/btrfs/file.c        |  2 +-
> >  fs/btrfs/fs.h          | 16 ++++++++++++++++
> >  fs/btrfs/inode.c       |  4 ++--
> >  fs/btrfs/ioctl.c       |  2 +-
> >  fs/btrfs/sysfs.c       |  2 +-
> >  fs/btrfs/transaction.c |  2 +-
> >  6 files changed, 22 insertions(+), 6 deletions(-)
> >
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 004c53482f05..723f0c70452e 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -1749,7 +1749,7 @@ static inline bool skip_inode_logging(const struct btrfs_log_ctx *ctx)
> >       struct btrfs_inode *inode = BTRFS_I(ctx->inode);
> >       struct btrfs_fs_info *fs_info = inode->root->fs_info;
> >
> > -     if (btrfs_inode_in_log(inode, fs_info->generation) &&
> > +     if (btrfs_inode_in_log(inode, btrfs_get_fs_generation(fs_info)) &&
> >           list_empty(&ctx->ordered_extents))
> >               return true;
> >
> > diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> > index 2bd9bedc7095..d04b729cbdf3 100644
> > --- a/fs/btrfs/fs.h
> > +++ b/fs/btrfs/fs.h
> > @@ -416,6 +416,12 @@ struct btrfs_fs_info {
> >
> >       struct btrfs_block_rsv empty_block_rsv;
> >
> > +     /*
> > +      * Updated while holding the lock 'trans_lock'. Due to the life cycle of
> > +      * a transaction, it can be directly read while holding a transaction
> > +      * handle, everywhere else must be read with btrfs_get_fs_generation().
> > +      * Should always be updated using btrfs_set_fs_generation().
> > +      */
> >       u64 generation;
> >       u64 last_trans_committed;
> >       /*
> > @@ -817,6 +823,16 @@ struct btrfs_fs_info {
> >  #endif
> >  };
> >
> > +static inline u64 btrfs_get_fs_generation(const struct btrfs_fs_info *fs_info)
> > +{
> > +     return READ_ONCE(fs_info->generation);
> > +}
> > +
> > +static inline void btrfs_set_fs_generation(struct btrfs_fs_info *fs_info, u64 gen)
> > +{
> > +     WRITE_ONCE(fs_info->generation, gen);
> > +}
> > +
> >  static inline void btrfs_set_last_root_drop_gen(struct btrfs_fs_info *fs_info,
> >                                               u64 gen)
> >  {
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 3b3aec302c33..c9317c047587 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -3800,7 +3800,7 @@ static int btrfs_read_locked_inode(struct inode *inode,
> >        * This is required for both inode re-read from disk and delayed inode
> >        * in delayed_nodes_tree.
> >        */
> > -     if (BTRFS_I(inode)->last_trans == fs_info->generation)
> > +     if (BTRFS_I(inode)->last_trans == btrfs_get_fs_generation(fs_info))
> >               set_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
> >                       &BTRFS_I(inode)->runtime_flags);
> >
> > @@ -4923,7 +4923,7 @@ int btrfs_cont_expand(struct btrfs_inode *inode, loff_t oldsize, loff_t size)
> >                       hole_em->orig_block_len = 0;
> >                       hole_em->ram_bytes = hole_size;
> >                       hole_em->compress_type = BTRFS_COMPRESS_NONE;
> > -                     hole_em->generation = fs_info->generation;
> > +                     hole_em->generation = btrfs_get_fs_generation(fs_info);
> >
> >                       err = btrfs_replace_extent_map_range(inode, hole_em, true);
> >                       free_extent_map(hole_em);
> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > index 848b7e6f6421..7ab21283fae8 100644
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -2822,7 +2822,7 @@ static long btrfs_ioctl_fs_info(struct btrfs_fs_info *fs_info,
> >       }
> >
> >       if (flags_in & BTRFS_FS_INFO_FLAG_GENERATION) {
> > -             fi_args->generation = fs_info->generation;
> > +             fi_args->generation = btrfs_get_fs_generation(fs_info);
> >               fi_args->flags |= BTRFS_FS_INFO_FLAG_GENERATION;
> >       }
> >
> > diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> > index e07be193323a..21ab8b9b62ce 100644
> > --- a/fs/btrfs/sysfs.c
> > +++ b/fs/btrfs/sysfs.c
> > @@ -1201,7 +1201,7 @@ static ssize_t btrfs_generation_show(struct kobject *kobj,
> >  {
> >       struct btrfs_fs_info *fs_info = to_fs_info(kobj);
> >
> > -     return sysfs_emit(buf, "%llu\n", fs_info->generation);
> > +     return sysfs_emit(buf, "%llu\n", btrfs_get_fs_generation(fs_info));
> >  }
> >  BTRFS_ATTR(, generation, btrfs_generation_show);
> >
> > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> > index 3aa59cfa4ab0..f5db3a483f40 100644
> > --- a/fs/btrfs/transaction.c
> > +++ b/fs/btrfs/transaction.c
> > @@ -386,7 +386,7 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info,
> >                       IO_TREE_TRANS_DIRTY_PAGES);
> >       extent_io_tree_init(fs_info, &cur_trans->pinned_extents,
> >                       IO_TREE_FS_PINNED_EXTENTS);
> > -     fs_info->generation++;
> > +     btrfs_set_fs_generation(fs_info, fs_info->generation + 1);
>
> Should this also use the helper for the part where it reads the value?
>
>         btrfs_set_fs_generation(fs_info, btrfs_get_fs_generation(fs_info) + 1);

Nop. As mentioned in the comment added over the field in the struct, the field
is only updated while holding fs_info->trans_lock - in fact this is
the only place
we update the field. So it is not necessary to use the helper to read here.

>
> It's a matter of annotation not a functional fix, I don't know if KCSAN
> would warn here. I'd expect that the unprotected access uses the helpers
> consistently, ie. both or none.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-10-06 14:43 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-04 10:38 [PATCH 0/6] btrfs: fix some data races during fsync and cleanups fdmanana
2023-10-04 10:38 ` [PATCH 1/6] btrfs: add and use helpers for reading and writing last_log_commit fdmanana
2023-10-04 10:38 ` [PATCH 2/6] btrfs: add and use helpers for reading and writing log_transid fdmanana
2023-10-04 10:38 ` [PATCH 3/6] btrfs: add and use helpers for reading and writing fs_info->generation fdmanana
2023-10-06 14:16   ` David Sterba
2023-10-06 14:42     ` Filipe Manana
2023-10-04 10:38 ` [PATCH 4/6] btrfs: add and use helpers for reading and writing last_trans_committed fdmanana
2023-10-04 10:38 ` [PATCH 5/6] btrfs: remove pointless barrier from btrfs_sync_file() fdmanana
2023-10-04 10:38 ` [PATCH 6/6] btrfs: update comment for struct btrfs_inode::lock fdmanana
2023-10-06 14:29 ` [PATCH 0/6] btrfs: fix some data races during fsync and cleanups David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).