All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/9] btrfs: simple quotas
@ 2023-05-03  0:59 Boris Burkov
  2023-05-03  0:59 ` [PATCH 1/9] btrfs: simple quotas mode Boris Burkov
                   ` (9 more replies)
  0 siblings, 10 replies; 23+ messages in thread
From: Boris Burkov @ 2023-05-03  0:59 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

btrfs quota groups (qgroups) are a compelling feature of btrfs that
allow flexible control for limiting subvolume data and metadata usage.
However, due to btrfs's high level decision to tradeoff snapshot
performance against ref-counting performance, qgroups suffer from
non-trivial performance issues that make them unattractive in certain
workloads. Particularly, frequent backref walking during writes and
during commits can make operations increasingly expensive as the number
of snapshots scales up. For that reason, we have never been able to
commit to using qgroups in production at Meta, despite significant
interest from people running container workloads, where we would benefit
from protecting the rest of the host from a buggy application in a
container running away with disk usage.

This patch series introduces a simplified version of qgroups called
simple quotas (squotas) which never computes global reference counts
for extents, and thus has similar performance characteristics to normal,
quotas disabled, btrfs. The "trick" is that in simple quotas mode, we
account all extents permanently to the subvolume in which they were
originally created. That allows us to make all accounting 1:1 with
extent item lifetime, removing the need to walk backrefs. However, this
sacrifices the ability to compute shared vs. exclusive usage. It also
results in counter-intuitive, though still predictable and simple,
accounting in the cases where an original extent is removed while a
shared copy still exists. Qgroups is able to detect that case and count
the remaining copy as an exclusive owner, while squotas is not. As a
result, squotas works best when the original extent is immutable and
outlives any clones.

In order to track the original creating subvolume of a data extent in
the face of reflinks, it is necessary to add additional accounting to
the extent item. To save space, this is done with a new inline ref item.
However, the downside of this approach is that it makes enabling squota
an incompat change, denoted by the new incompat bit SIMPLE_QUOTA. When
this bit is set and quotas are enabled, new extent items get the extra
accounting, and freed extent items check for the accounting to find
their creating subvolume.

Squotas reuses the api of qgroups. The only difference is that when you
enable quotas via `btrfs quota enable`, you pass the `--simple` flag.
Squotas will always report exclusive == shared for each qgroup.

This is still a preliminary RFC patch series, so not all the ducks are
fully in a row. In particular, some userspace parts are missing, like
meaningful integration with fsck, which will drive further testing.

My current branches for btrfs-progs and fstests do contain some (sloppy)
minimal support needed to run and test the feature:
btrfs-progs: https://github.com/boryas/btrfs-progs/tree/squota-progs
fstests: https://github.com/boryas/fstests/tree/squota-test

Current testing methodology:
- New fstest (btrfs/400 in the squota-test branch)
- Run all fstests with squota enabled at mkfs time. Not all tests are
  passing in this regime, though this is actually true of qgroups as
  well. Most of the issues have to do with leaking reserved space in
  less commonly tested cases like I/O failures. My intent is to get this
  test suite fully passing.
- Run all fstests without squota enabled at mkfs time

Basic performance test:
In this test, I ran a workload which generated K files in a subvolume,
then took L snapshots of that subvolume, then unshared each file in
each subvolume. The measurement is just total walltime. K is the row
index and L the column index, so in these tables, we vary between 1
and 100 files and 1 and 10000 snapshots. The "n" table is no quotas,
the "q" table is qgroups and the "s" table is squotas. As you can see,
"n" and "s" are quite similar, while "q" falls of a cliff as the
number of snapshots increases. More sophisticated and realistic
performance testing that doesn't abuse such an insane number of
snapshots is still to come.

n
        1       10      100     1000    10000
1       0.18    0.24    1.58    16.49   211.34
10      0.28    0.43    2.80    29.74   324.70
100     0.55    0.99    6.57    65.13   717.51

q
        1       10      100     1000    10000
1       2.19    0.35    2.32    25.78   756.62
10      0.34    0.48    3.24    68.72   3731.73
100     0.64    0.80    7.63    215.54  14170.73

s
        1       10      100     1000    10000
1       2.19    0.32    1.83    19.19   231.75
10      0.31    0.43    2.86    28.86   351.42
100     0.70    0.90    6.75    67.89   742.93


Boris Burkov (9):
  btrfs: simple quotas mode
  btrfs: new function for recording simple quota usage
  btrfs: track original extent subvol in a new inline ref
  btrfs: track metadata owning root in delayed refs
  btrfs: record simple quota deltas
  btrfs: auto hierarchy for simple qgroups of nested subvols
  btrfs: check generation when recording simple quota delta
  btrfs: expose the qgroup mode via sysfs
  btrfs: free qgroup rsv on io failure

 fs/btrfs/accessors.h            |   6 +
 fs/btrfs/backref.c              |   3 +
 fs/btrfs/delayed-ref.c          |  13 +-
 fs/btrfs/delayed-ref.h          |  28 ++++-
 fs/btrfs/extent-tree.c          | 143 +++++++++++++++++----
 fs/btrfs/fs.h                   |   7 +-
 fs/btrfs/ioctl.c                |   4 +-
 fs/btrfs/ordered-data.c         |   6 +-
 fs/btrfs/print-tree.c           |  12 ++
 fs/btrfs/qgroup.c               | 216 +++++++++++++++++++++++++++++---
 fs/btrfs/qgroup.h               |  29 ++++-
 fs/btrfs/ref-verify.c           |   3 +
 fs/btrfs/relocation.c           |  11 +-
 fs/btrfs/sysfs.c                |  26 ++++
 fs/btrfs/transaction.c          |  11 +-
 fs/btrfs/tree-checker.c         |   3 +
 include/uapi/linux/btrfs.h      |   1 +
 include/uapi/linux/btrfs_tree.h |  13 ++
 18 files changed, 471 insertions(+), 64 deletions(-)

-- 
2.40.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/9] btrfs: simple quotas mode
  2023-05-03  0:59 [PATCH RFC 0/9] btrfs: simple quotas Boris Burkov
@ 2023-05-03  0:59 ` Boris Burkov
  2023-05-03  2:38   ` Qu Wenruo
  2023-05-03  0:59 ` [PATCH 2/9] btrfs: new function for recording simple quota usage Boris Burkov
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Boris Burkov @ 2023-05-03  0:59 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Allow the quota enable ioctl to specify simple quotas. Set an INCOMPAT
bit when simple quotas are enabled, as it will result in several
breaking changes to the on-disk format.

Introduce an enum for capturing the current quota mode. Rather than just
enabled/disabled, the possible settings are now disabled/simple/full.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/fs.h              |  5 ++-
 fs/btrfs/ioctl.c           |  2 +-
 fs/btrfs/qgroup.c          | 68 ++++++++++++++++++++++++++++++--------
 fs/btrfs/qgroup.h          | 10 +++++-
 fs/btrfs/transaction.c     |  4 +--
 include/uapi/linux/btrfs.h |  1 +
 6 files changed, 69 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 0d98fc5f6f44..6c989d87768c 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -218,7 +218,8 @@ enum {
 	 BTRFS_FEATURE_INCOMPAT_NO_HOLES	|	\
 	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID	|	\
 	 BTRFS_FEATURE_INCOMPAT_RAID1C34	|	\
-	 BTRFS_FEATURE_INCOMPAT_ZONED)
+	 BTRFS_FEATURE_INCOMPAT_ZONED		|	\
+	 BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA)
 
 #ifdef CONFIG_BTRFS_DEBUG
 	/*
@@ -233,7 +234,6 @@ enum {
 
 #define BTRFS_FEATURE_INCOMPAT_SUPP		\
 	(BTRFS_FEATURE_INCOMPAT_SUPP_STABLE)
-
 #endif
 
 #define BTRFS_FEATURE_INCOMPAT_SAFE_SET			\
@@ -791,7 +791,6 @@ struct btrfs_fs_info {
 	struct lockdep_map btrfs_state_change_map[4];
 	struct lockdep_map btrfs_trans_pending_ordered_map;
 	struct lockdep_map btrfs_ordered_extent_map;
-
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	spinlock_t ref_verify_lock;
 	struct rb_root block_tree;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 9522669000a7..ca7d2ef739c8 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3685,7 +3685,7 @@ static long btrfs_ioctl_quota_ctl(struct file *file, void __user *arg)
 
 	switch (sa->cmd) {
 	case BTRFS_QUOTA_CTL_ENABLE:
-		ret = btrfs_quota_enable(fs_info);
+		ret = btrfs_quota_enable(fs_info, sa);
 		break;
 	case BTRFS_QUOTA_CTL_DISABLE:
 		ret = btrfs_quota_disable(fs_info);
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index f41da7ac360d..3c8b296215ee 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -3,6 +3,7 @@
  * Copyright (C) 2011 STRATO.  All rights reserved.
  */
 
+#include <linux/btrfs.h>
 #include <linux/sched.h>
 #include <linux/pagemap.h>
 #include <linux/writeback.h>
@@ -10,7 +11,6 @@
 #include <linux/rbtree.h>
 #include <linux/slab.h>
 #include <linux/workqueue.h>
-#include <linux/btrfs.h>
 #include <linux/sched/mm.h>
 
 #include "ctree.h"
@@ -30,6 +30,15 @@
 #include "root-tree.h"
 #include "tree-checker.h"
 
+enum btrfs_qgroup_mode btrfs_qgroup_mode(struct btrfs_fs_info *fs_info)
+{
+	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
+		return BTRFS_QGROUP_MODE_DISABLED;
+	if (btrfs_fs_incompat(fs_info, SIMPLE_QUOTA))
+		return BTRFS_QGROUP_MODE_SIMPLE;
+	return BTRFS_QGROUP_MODE_FULL;
+}
+
 /*
  * Helpers to access qgroup reservation
  *
@@ -340,6 +349,8 @@ int btrfs_verify_qgroup_counts(struct btrfs_fs_info *fs_info, u64 qgroupid,
 
 static void qgroup_mark_inconsistent(struct btrfs_fs_info *fs_info)
 {
+	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
+		return;
 	fs_info->qgroup_flags |= (BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT |
 				  BTRFS_QGROUP_RUNTIME_FLAG_CANCEL_RESCAN |
 				  BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING);
@@ -412,7 +423,8 @@ int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info)
 				goto out;
 			}
 			if (btrfs_qgroup_status_generation(l, ptr) !=
-			    fs_info->generation) {
+			    fs_info->generation &&
+			    !btrfs_fs_incompat(fs_info, SIMPLE_QUOTA)) {
 				qgroup_mark_inconsistent(fs_info);
 				btrfs_err(fs_info,
 					"qgroup generation mismatch, marked as inconsistent");
@@ -949,7 +961,8 @@ static int btrfs_clean_quota_tree(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
-int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
+int btrfs_quota_enable(struct btrfs_fs_info *fs_info,
+		       struct btrfs_ioctl_quota_ctl_args *quota_ctl_args)
 {
 	struct btrfs_root *quota_root;
 	struct btrfs_root *tree_root = fs_info->tree_root;
@@ -961,6 +974,7 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
 	struct btrfs_qgroup *qgroup = NULL;
 	struct btrfs_trans_handle *trans = NULL;
 	struct ulist *ulist = NULL;
+	bool simple_qgroups = quota_ctl_args->status == 42;
 	int ret = 0;
 	int slot;
 
@@ -1063,8 +1077,9 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
 				 struct btrfs_qgroup_status_item);
 	btrfs_set_qgroup_status_generation(leaf, ptr, trans->transid);
 	btrfs_set_qgroup_status_version(leaf, ptr, BTRFS_QGROUP_STATUS_VERSION);
-	fs_info->qgroup_flags = BTRFS_QGROUP_STATUS_FLAG_ON |
-				BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
+	fs_info->qgroup_flags = BTRFS_QGROUP_STATUS_FLAG_ON;
+	if (!simple_qgroups)
+		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
 	btrfs_set_qgroup_status_flags(leaf, ptr, fs_info->qgroup_flags &
 				      BTRFS_QGROUP_STATUS_FLAGS_MASK);
 	btrfs_set_qgroup_status_rescan(leaf, ptr, 0);
@@ -1180,8 +1195,14 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
 	spin_lock(&fs_info->qgroup_lock);
 	fs_info->quota_root = quota_root;
 	set_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags);
+	if (simple_qgroups)
+		btrfs_set_fs_incompat(fs_info, SIMPLE_QUOTA);
 	spin_unlock(&fs_info->qgroup_lock);
 
+	/* Skip rescan for simple qgroups */
+	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
+		goto out_free_path;
+
 	ret = qgroup_rescan_init(fs_info, 0, 1);
 	if (!ret) {
 	        qgroup_rescan_zero_tracking(fs_info);
@@ -1766,6 +1787,9 @@ int btrfs_qgroup_trace_extent_nolock(struct btrfs_fs_info *fs_info,
 	struct btrfs_qgroup_extent_record *entry;
 	u64 bytenr = record->bytenr;
 
+	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
+		return 0;
+
 	lockdep_assert_held(&delayed_refs->lock);
 	trace_btrfs_qgroup_trace_extent(fs_info, record);
 
@@ -1798,6 +1822,8 @@ int btrfs_qgroup_trace_extent_post(struct btrfs_trans_handle *trans,
 	struct btrfs_backref_walk_ctx ctx = { 0 };
 	int ret;
 
+	if (btrfs_qgroup_mode(trans->fs_info) != BTRFS_QGROUP_MODE_FULL)
+		return 0;
 	/*
 	 * We are always called in a context where we are already holding a
 	 * transaction handle. Often we are called when adding a data delayed
@@ -1853,7 +1879,7 @@ int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans, u64 bytenr,
 	struct btrfs_delayed_ref_root *delayed_refs;
 	int ret;
 
-	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags)
+	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL
 	    || bytenr == 0 || num_bytes == 0)
 		return 0;
 	record = kzalloc(sizeof(*record), GFP_NOFS);
@@ -1886,7 +1912,7 @@ int btrfs_qgroup_trace_leaf_items(struct btrfs_trans_handle *trans,
 	u64 bytenr, num_bytes;
 
 	/* We can be called directly from walk_up_proc() */
-	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
+	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
 		return 0;
 
 	for (i = 0; i < nr; i++) {
@@ -2262,7 +2288,7 @@ static int qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans,
 	int level;
 	int ret;
 
-	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
+	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
 		return 0;
 
 	/* Wrong parameter order */
@@ -2319,7 +2345,7 @@ int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle *trans,
 	BUG_ON(root_level < 0 || root_level >= BTRFS_MAX_LEVEL);
 	BUG_ON(root_eb == NULL);
 
-	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
+	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
 		return 0;
 
 	spin_lock(&fs_info->qgroup_lock);
@@ -2659,7 +2685,7 @@ int btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans, u64 bytenr,
 	 * If quotas get disabled meanwhile, the resources need to be freed and
 	 * we can't just exit here.
 	 */
-	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags) ||
+	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL ||
 	    fs_info->qgroup_flags & BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING)
 		goto out_free;
 
@@ -2747,6 +2773,9 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans)
 	u64 qgroup_to_skip;
 	int ret = 0;
 
+	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
+		return 0;
+
 	delayed_refs = &trans->transaction->delayed_refs;
 	qgroup_to_skip = delayed_refs->qgroup_to_skip;
 	while ((node = rb_first(&delayed_refs->dirty_extent_root))) {
@@ -2989,11 +3018,10 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
 		qgroup_dirty(fs_info, dstgroup);
 	}
 
-	if (srcid) {
+	if (srcid && btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_FULL) {
 		srcgroup = find_qgroup_rb(fs_info, srcid);
 		if (!srcgroup)
 			goto unlock;
-
 		/*
 		 * We call inherit after we clone the root in order to make sure
 		 * our counts don't go crazy, so at this point the only
@@ -3281,6 +3309,9 @@ static int qgroup_rescan_leaf(struct btrfs_trans_handle *trans,
 	int slot;
 	int ret;
 
+	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
+		return 1;
+
 	mutex_lock(&fs_info->qgroup_rescan_lock);
 	extent_root = btrfs_extent_root(fs_info,
 				fs_info->qgroup_rescan_progress.objectid);
@@ -3378,6 +3409,9 @@ static void btrfs_qgroup_rescan_worker(struct btrfs_work *work)
 	bool stopped = false;
 	bool did_leaf_rescans = false;
 
+	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
+		return;
+
 	path = btrfs_alloc_path();
 	if (!path)
 		goto out;
@@ -3481,6 +3515,12 @@ qgroup_rescan_init(struct btrfs_fs_info *fs_info, u64 progress_objectid,
 {
 	int ret = 0;
 
+	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE) {
+		btrfs_warn(fs_info, "qgroup rescan init failed, running in simple mode. mode: %d\n",
+			btrfs_qgroup_mode(fs_info));
+		return -EINVAL;
+	}
+
 	if (!init_flags) {
 		/* we're resuming qgroup rescan at mount time */
 		if (!(fs_info->qgroup_flags &
@@ -4240,7 +4280,7 @@ int btrfs_qgroup_add_swapped_blocks(struct btrfs_trans_handle *trans,
 	int level = btrfs_header_level(subvol_parent) - 1;
 	int ret = 0;
 
-	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
+	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
 		return 0;
 
 	if (btrfs_node_ptr_generation(subvol_parent, subvol_slot) >
@@ -4350,7 +4390,7 @@ int btrfs_qgroup_trace_subtree_after_cow(struct btrfs_trans_handle *trans,
 	int ret = 0;
 	int i;
 
-	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
+	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
 		return 0;
 	if (!is_fstree(root->root_key.objectid) || !root->reloc_root)
 		return 0;
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 7bffa10589d6..d4c4d039585f 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -249,7 +249,15 @@ enum {
 	ENUM_BIT(QGROUP_FREE),
 };
 
-int btrfs_quota_enable(struct btrfs_fs_info *fs_info);
+enum btrfs_qgroup_mode {
+	BTRFS_QGROUP_MODE_DISABLED,
+	BTRFS_QGROUP_MODE_FULL,
+	BTRFS_QGROUP_MODE_SIMPLE
+};
+
+enum btrfs_qgroup_mode btrfs_qgroup_mode(struct btrfs_fs_info *fs_info);
+int btrfs_quota_enable(struct btrfs_fs_info *fs_info,
+		       struct btrfs_ioctl_quota_ctl_args *quota_ctl_args);
 int btrfs_quota_disable(struct btrfs_fs_info *fs_info);
 int btrfs_qgroup_rescan(struct btrfs_fs_info *fs_info);
 void btrfs_qgroup_rescan_resume(struct btrfs_fs_info *fs_info);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 8b6a99b8d7f6..e6d6752c2fca 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1514,11 +1514,11 @@ static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
 	int ret;
 
 	/*
-	 * Save some performance in the case that qgroups are not
+	 * Save some performance in the case that full qgroups are not
 	 * enabled. If this check races with the ioctl, rescan will
 	 * kick in anyway.
 	 */
-	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
+	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
 		return 0;
 
 	/*
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index dbb8b96da50d..957ca4037974 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -333,6 +333,7 @@ struct btrfs_ioctl_fs_info_args {
 #define BTRFS_FEATURE_INCOMPAT_RAID1C34		(1ULL << 11)
 #define BTRFS_FEATURE_INCOMPAT_ZONED		(1ULL << 12)
 #define BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2	(1ULL << 13)
+#define BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA	(1ULL << 14)
 
 struct btrfs_ioctl_feature_flags {
 	__u64 compat_flags;
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/9] btrfs: new function for recording simple quota usage
  2023-05-03  0:59 [PATCH RFC 0/9] btrfs: simple quotas Boris Burkov
  2023-05-03  0:59 ` [PATCH 1/9] btrfs: simple quotas mode Boris Burkov
@ 2023-05-03  0:59 ` Boris Burkov
  2023-05-03  0:59 ` [PATCH 3/9] btrfs: track original extent subvol in a new inline ref Boris Burkov
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Boris Burkov @ 2023-05-03  0:59 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Rather than re-computing shared/exclusive ownership based on backrefs
and walking roots for implicit backrefs, simple quotas does an increment
when creating an extent and a decrement when deleting it. Add the API
for the extent item code to use to track those events.

Also add a helper function to make collecting parent qgroups in a ulist
easier for functions like this.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/qgroup.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/qgroup.h | 10 +++++-
 2 files changed, 94 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 3c8b296215ee..8982b76ae9e5 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -332,6 +332,44 @@ static int del_relation_rb(struct btrfs_fs_info *fs_info,
 	return -ENOENT;
 }
 
+static int qgroup_collect_parents(struct btrfs_qgroup *qgroup,
+				  struct ulist *ul)
+{
+	struct ulist_iterator uiter;
+	struct ulist_node *unode;
+	struct btrfs_qgroup_list *glist;
+	struct btrfs_qgroup *qg;
+	bool err_free = false;
+	int ret = 0;
+
+	if (!ul) {
+		ul = ulist_alloc(GFP_KERNEL);
+		err_free = true;
+	} else {
+		ulist_reinit(ul);
+	}
+
+	ret = ulist_add(ul, qgroup->qgroupid,
+			qgroup_to_aux(qgroup), GFP_ATOMIC);
+	if (ret < 0)
+		goto out;
+	ULIST_ITER_INIT(&uiter);
+	while ((unode = ulist_next(ul, &uiter))) {
+		qg = unode_aux_to_qgroup(unode);
+		list_for_each_entry(glist, &qg->groups, next_group) {
+			ret = ulist_add(ul, glist->group->qgroupid,
+					qgroup_to_aux(glist->group), GFP_ATOMIC);
+			if (ret < 0)
+				goto out;
+		}
+	}
+	ret = 0;
+out:
+	if (ret && err_free)
+		ulist_free(ul);
+	return ret;
+}
+
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 int btrfs_verify_qgroup_counts(struct btrfs_fs_info *fs_info, u64 qgroupid,
 			       u64 rfer, u64 excl)
@@ -4472,3 +4510,50 @@ void btrfs_qgroup_destroy_extent_records(struct btrfs_transaction *trans)
 		kfree(entry);
 	}
 }
+
+int btrfs_record_simple_quota_delta(struct btrfs_fs_info *fs_info,
+				    struct btrfs_simple_quota_delta *delta)
+{
+	int ret;
+	struct ulist *ul = fs_info->qgroup_ulist;
+	struct btrfs_qgroup *qgroup;
+	struct ulist_iterator uiter;
+	struct ulist_node *unode;
+	struct btrfs_qgroup *qg;
+	bool drop_rsv = false;
+	u64 root = delta->root;
+	u64 num_bytes = delta->num_bytes;
+	int sign = delta->is_inc ? 1 : -1;
+
+	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_SIMPLE)
+		return 0;
+
+	if (!is_fstree(root))
+		return 0;
+
+	spin_lock(&fs_info->qgroup_lock);
+	qgroup = find_qgroup_rb(fs_info, root);
+	if (!qgroup) {
+		ret = -ENOENT;
+		goto out;
+	}
+	ret = qgroup_collect_parents(qgroup, ul);
+	if (ret)
+		goto out;
+
+	ULIST_ITER_INIT(&uiter);
+	while ((unode = ulist_next(ul, &uiter))) {
+		qg = unode_aux_to_qgroup(unode);
+		qg->excl += num_bytes * sign;
+		qg->rfer += num_bytes * sign;
+		if (delta->is_inc && delta->is_data)
+			drop_rsv = true;
+		qgroup_dirty(fs_info, qg);
+	}
+
+out:
+	spin_unlock(&fs_info->qgroup_lock);
+	if (!ret && drop_rsv)
+		btrfs_qgroup_free_refroot(fs_info, root, num_bytes, BTRFS_QGROUP_RSV_DATA);
+	return ret;
+}
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index d4c4d039585f..0d627a871900 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -235,6 +235,13 @@ struct btrfs_qgroup {
 	struct kobject kobj;
 };
 
+struct btrfs_simple_quota_delta {
+	u64 root; /* The fstree root this delta counts against */
+	u64 num_bytes; /* The number of bytes in the extent being counted */
+	bool is_inc; /* Whether we are using or freeing the extent */
+	bool is_data; /* Whether the extent is data or metadata */
+};
+
 static inline u64 btrfs_qgroup_subvolid(u64 qgroupid)
 {
 	return (qgroupid & ((1ULL << BTRFS_QGROUP_LEVEL_SHIFT) - 1));
@@ -447,5 +454,6 @@ int btrfs_qgroup_trace_subtree_after_cow(struct btrfs_trans_handle *trans,
 		struct btrfs_root *root, struct extent_buffer *eb);
 void btrfs_qgroup_destroy_extent_records(struct btrfs_transaction *trans);
 bool btrfs_check_quota_leak(struct btrfs_fs_info *fs_info);
-
+int btrfs_record_simple_quota_delta(struct btrfs_fs_info *fs_info,
+				    struct btrfs_simple_quota_delta *delta);
 #endif
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/9] btrfs: track original extent subvol in a new inline ref
  2023-05-03  0:59 [PATCH RFC 0/9] btrfs: simple quotas Boris Burkov
  2023-05-03  0:59 ` [PATCH 1/9] btrfs: simple quotas mode Boris Burkov
  2023-05-03  0:59 ` [PATCH 2/9] btrfs: new function for recording simple quota usage Boris Burkov
@ 2023-05-03  0:59 ` Boris Burkov
  2023-05-03  3:17   ` Qu Wenruo
  2023-05-03  0:59 ` [PATCH 4/9] btrfs: track metadata owning root in delayed refs Boris Burkov
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Boris Burkov @ 2023-05-03  0:59 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

In order to implement simple quota groups, we need to be able to
associate a data extent with the subvolume that created it. Once you
account for reflink, this information cannot be recovered without
explicitly storing it. Options for storing it are:
- a new key/item
- a new extent inline ref item

The former is backwards compatible, but wastes space, the latter is
incompat, but is efficient in space and reuses the existing inline ref
machinery, while only abusing it a tiny amount -- specifically, the new
item is not a ref, per-se.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/accessors.h            |  4 +++
 fs/btrfs/backref.c              |  3 ++
 fs/btrfs/extent-tree.c          | 50 +++++++++++++++++++++++++--------
 fs/btrfs/print-tree.c           | 12 ++++++++
 fs/btrfs/ref-verify.c           |  3 ++
 fs/btrfs/tree-checker.c         |  3 ++
 include/uapi/linux/btrfs_tree.h |  6 ++++
 7 files changed, 70 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
index ceadfc5d6c66..aab61312e4e8 100644
--- a/fs/btrfs/accessors.h
+++ b/fs/btrfs/accessors.h
@@ -350,6 +350,8 @@ BTRFS_SETGET_FUNCS(extent_data_ref_count, struct btrfs_extent_data_ref, count, 3
 
 BTRFS_SETGET_FUNCS(shared_data_ref_count, struct btrfs_shared_data_ref, count, 32);
 
+BTRFS_SETGET_FUNCS(extent_owner_ref_root_id, struct btrfs_extent_owner_ref, root_id, 64);
+
 BTRFS_SETGET_FUNCS(extent_inline_ref_type, struct btrfs_extent_inline_ref,
 		   type, 8);
 BTRFS_SETGET_FUNCS(extent_inline_ref_offset, struct btrfs_extent_inline_ref,
@@ -366,6 +368,8 @@ static inline u32 btrfs_extent_inline_ref_size(int type)
 	if (type == BTRFS_EXTENT_DATA_REF_KEY)
 		return sizeof(struct btrfs_extent_data_ref) +
 		       offsetof(struct btrfs_extent_inline_ref, offset);
+	if (type == BTRFS_EXTENT_OWNER_REF_KEY)
+		return sizeof(struct btrfs_extent_inline_ref);
 	return 0;
 }
 
diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index e54f0884802a..8cd8ed6c572f 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1128,6 +1128,9 @@ static int add_inline_refs(struct btrfs_backref_walk_ctx *ctx,
 						       count, sc, GFP_NOFS);
 			break;
 		}
+		case BTRFS_EXTENT_OWNER_REF_KEY:
+			WARN_ON(!btrfs_fs_incompat(ctx->fs_info, SIMPLE_QUOTA));
+			break;
 		default:
 			WARN_ON(1);
 		}
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 5cd289de4e92..b9a2f1e355b7 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -363,9 +363,13 @@ int btrfs_get_extent_inline_ref_type(const struct extent_buffer *eb,
 				     struct btrfs_extent_inline_ref *iref,
 				     enum btrfs_inline_ref_type is_data)
 {
+	struct btrfs_fs_info *fs_info = eb->fs_info;
 	int type = btrfs_extent_inline_ref_type(eb, iref);
 	u64 offset = btrfs_extent_inline_ref_offset(eb, iref);
 
+	if (type == BTRFS_EXTENT_OWNER_REF_KEY && btrfs_fs_incompat(fs_info, SIMPLE_QUOTA))
+		return type;
+
 	if (type == BTRFS_TREE_BLOCK_REF_KEY ||
 	    type == BTRFS_SHARED_BLOCK_REF_KEY ||
 	    type == BTRFS_SHARED_DATA_REF_KEY ||
@@ -374,26 +378,25 @@ int btrfs_get_extent_inline_ref_type(const struct extent_buffer *eb,
 			if (type == BTRFS_TREE_BLOCK_REF_KEY)
 				return type;
 			if (type == BTRFS_SHARED_BLOCK_REF_KEY) {
-				ASSERT(eb->fs_info);
+				ASSERT(fs_info);
 				/*
 				 * Every shared one has parent tree block,
 				 * which must be aligned to sector size.
 				 */
-				if (offset &&
-				    IS_ALIGNED(offset, eb->fs_info->sectorsize))
+				if (offset && IS_ALIGNED(offset, fs_info->sectorsize))
 					return type;
 			}
 		} else if (is_data == BTRFS_REF_TYPE_DATA) {
 			if (type == BTRFS_EXTENT_DATA_REF_KEY)
 				return type;
 			if (type == BTRFS_SHARED_DATA_REF_KEY) {
-				ASSERT(eb->fs_info);
+				ASSERT(fs_info);
 				/*
 				 * Every shared one has parent tree block,
 				 * which must be aligned to sector size.
 				 */
 				if (offset &&
-				    IS_ALIGNED(offset, eb->fs_info->sectorsize))
+				    IS_ALIGNED(offset, fs_info->sectorsize))
 					return type;
 			}
 		} else {
@@ -403,7 +406,7 @@ int btrfs_get_extent_inline_ref_type(const struct extent_buffer *eb,
 	}
 
 	btrfs_print_leaf((struct extent_buffer *)eb);
-	btrfs_err(eb->fs_info,
+	btrfs_err(fs_info,
 		  "eb %llu iref 0x%lx invalid extent inline ref type %d",
 		  eb->start, (unsigned long)iref, type);
 	WARN_ON(1);
@@ -912,6 +915,11 @@ int lookup_inline_extent_backref(struct btrfs_trans_handle *trans,
 		}
 		iref = (struct btrfs_extent_inline_ref *)ptr;
 		type = btrfs_get_extent_inline_ref_type(leaf, iref, needed);
+		if (type == BTRFS_EXTENT_OWNER_REF_KEY) {
+			WARN_ON(!btrfs_fs_incompat(fs_info, SIMPLE_QUOTA));
+			ptr += btrfs_extent_inline_ref_size(type);
+			continue;
+		}
 		if (type == BTRFS_REF_TYPE_INVALID) {
 			err = -EUCLEAN;
 			goto out;
@@ -1708,6 +1716,8 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
 		 node->type == BTRFS_SHARED_DATA_REF_KEY)
 		ret = run_delayed_data_ref(trans, node, extent_op,
 					   insert_reserved);
+	else if (node->type == BTRFS_EXTENT_OWNER_REF_KEY)
+		ret = 0;
 	else
 		BUG();
 	if (ret && insert_reserved)
@@ -2275,6 +2285,7 @@ static noinline int check_committed_ref(struct btrfs_root *root,
 	struct btrfs_extent_item *ei;
 	struct btrfs_key key;
 	u32 item_size;
+	u32 expected_size;
 	int type;
 	int ret;
 
@@ -2301,10 +2312,17 @@ static noinline int check_committed_ref(struct btrfs_root *root,
 	ret = 1;
 	item_size = btrfs_item_size(leaf, path->slots[0]);
 	ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item);
+	expected_size = sizeof(*ei) + btrfs_extent_inline_ref_size(BTRFS_EXTENT_DATA_REF_KEY);
+
+	iref = (struct btrfs_extent_inline_ref *)(ei + 1);
+	type = btrfs_get_extent_inline_ref_type(leaf, iref, BTRFS_REF_TYPE_DATA);
+	if (btrfs_fs_incompat(fs_info, SIMPLE_QUOTA) && type == BTRFS_EXTENT_OWNER_REF_KEY) {
+		expected_size += btrfs_extent_inline_ref_size(BTRFS_EXTENT_OWNER_REF_KEY);
+		iref = (struct btrfs_extent_inline_ref *)(iref + 1);
+	}
 
 	/* If extent item has more than 1 inline ref then it's shared */
-	if (item_size != sizeof(*ei) +
-	    btrfs_extent_inline_ref_size(BTRFS_EXTENT_DATA_REF_KEY))
+	if (item_size != expected_size)
 		goto out;
 
 	/*
@@ -2316,8 +2334,6 @@ static noinline int check_committed_ref(struct btrfs_root *root,
 	     btrfs_root_last_snapshot(&root->root_item)))
 		goto out;
 
-	iref = (struct btrfs_extent_inline_ref *)(ei + 1);
-
 	/* If this extent has SHARED_DATA_REF then it's shared */
 	type = btrfs_get_extent_inline_ref_type(leaf, iref, BTRFS_REF_TYPE_DATA);
 	if (type != BTRFS_EXTENT_DATA_REF_KEY)
@@ -4572,6 +4588,7 @@ static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
 	struct btrfs_root *extent_root;
 	int ret;
 	struct btrfs_extent_item *extent_item;
+	struct btrfs_extent_owner_ref *oref;
 	struct btrfs_extent_inline_ref *iref;
 	struct btrfs_path *path;
 	struct extent_buffer *leaf;
@@ -4583,7 +4600,10 @@ static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
 	else
 		type = BTRFS_EXTENT_DATA_REF_KEY;
 
-	size = sizeof(*extent_item) + btrfs_extent_inline_ref_size(type);
+	size = sizeof(*extent_item);
+	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
+		size += btrfs_extent_inline_ref_size(BTRFS_EXTENT_OWNER_REF_KEY);
+	size += btrfs_extent_inline_ref_size(type);
 
 	path = btrfs_alloc_path();
 	if (!path)
@@ -4604,8 +4624,16 @@ static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
 	btrfs_set_extent_flags(leaf, extent_item,
 			       flags | BTRFS_EXTENT_FLAG_DATA);
 
+
 	iref = (struct btrfs_extent_inline_ref *)(extent_item + 1);
+	if (btrfs_fs_incompat(fs_info, SIMPLE_QUOTA)) {
+		btrfs_set_extent_inline_ref_type(leaf, iref, BTRFS_EXTENT_OWNER_REF_KEY);
+		oref = (struct btrfs_extent_owner_ref *)(&iref->offset);
+		btrfs_set_extent_owner_ref_root_id(leaf, oref, root_objectid);
+		iref = (struct btrfs_extent_inline_ref *)(oref + 1);
+	}
 	btrfs_set_extent_inline_ref_type(leaf, iref, type);
+
 	if (parent > 0) {
 		struct btrfs_shared_data_ref *ref;
 		ref = (struct btrfs_shared_data_ref *)(iref + 1);
diff --git a/fs/btrfs/print-tree.c b/fs/btrfs/print-tree.c
index b93c96213304..1114cd915bd8 100644
--- a/fs/btrfs/print-tree.c
+++ b/fs/btrfs/print-tree.c
@@ -80,12 +80,20 @@ static void print_extent_data_ref(struct extent_buffer *eb,
 	       btrfs_extent_data_ref_count(eb, ref));
 }
 
+static void print_extent_owner_ref(struct extent_buffer *eb,
+				   struct btrfs_extent_owner_ref *ref)
+{
+	WARN_ON(!btrfs_fs_incompat(eb->fs_info, SIMPLE_QUOTA));
+	pr_cont("extent data owner root %llu\n", btrfs_extent_owner_ref_root_id(eb, ref));
+}
+
 static void print_extent_item(struct extent_buffer *eb, int slot, int type)
 {
 	struct btrfs_extent_item *ei;
 	struct btrfs_extent_inline_ref *iref;
 	struct btrfs_extent_data_ref *dref;
 	struct btrfs_shared_data_ref *sref;
+	struct btrfs_extent_owner_ref *oref;
 	struct btrfs_disk_key key;
 	unsigned long end;
 	unsigned long ptr;
@@ -159,6 +167,10 @@ static void print_extent_item(struct extent_buffer *eb, int slot, int type)
 			"\t\t\t(parent %llu not aligned to sectorsize %u)\n",
 				     offset, eb->fs_info->sectorsize);
 			break;
+		case BTRFS_EXTENT_OWNER_REF_KEY:
+			oref = (struct btrfs_extent_owner_ref *)(&iref->offset);
+			print_extent_owner_ref(eb, oref);
+			break;
 		default:
 			pr_cont("(extent %llu has INVALID ref type %d)\n",
 				  eb->start, type);
diff --git a/fs/btrfs/ref-verify.c b/fs/btrfs/ref-verify.c
index 95d28497de7c..9edc87eaff1f 100644
--- a/fs/btrfs/ref-verify.c
+++ b/fs/btrfs/ref-verify.c
@@ -485,6 +485,9 @@ static int process_extent_item(struct btrfs_fs_info *fs_info,
 			ret = add_shared_data_ref(fs_info, offset, count,
 						  key->objectid, key->offset);
 			break;
+		case BTRFS_EXTENT_OWNER_REF_KEY:
+			WARN_ON(!btrfs_fs_incompat(fs_info, SIMPLE_QUOTA));
+			break;
 		default:
 			btrfs_err(fs_info, "invalid key type in iref");
 			ret = -EINVAL;
diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index e2b54793bf0c..27d4230a38a8 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -1451,6 +1451,9 @@ static int check_extent_item(struct extent_buffer *leaf,
 			}
 			inline_refs += btrfs_shared_data_ref_count(leaf, sref);
 			break;
+		case BTRFS_EXTENT_OWNER_REF_KEY:
+			WARN_ON(!btrfs_fs_incompat(fs_info, SIMPLE_QUOTA));
+			break;
 		default:
 			extent_err(leaf, slot, "unknown inline ref type: %u",
 				   inline_type);
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index ab38d0f411fa..424c7f342712 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -226,6 +226,8 @@
 
 #define BTRFS_SHARED_DATA_REF_KEY	184
 
+#define BTRFS_EXTENT_OWNER_REF_KEY	190
+
 /*
  * block groups give us hints into the extent allocation trees.  Which
  * blocks are free etc etc
@@ -783,6 +785,10 @@ struct btrfs_shared_data_ref {
 	__le32 count;
 } __attribute__ ((__packed__));
 
+struct btrfs_extent_owner_ref {
+	u64 root_id;
+} __attribute__ ((__packed__));
+
 struct btrfs_extent_inline_ref {
 	__u8 type;
 	__le64 offset;
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/9] btrfs: track metadata owning root in delayed refs
  2023-05-03  0:59 [PATCH RFC 0/9] btrfs: simple quotas Boris Burkov
                   ` (2 preceding siblings ...)
  2023-05-03  0:59 ` [PATCH 3/9] btrfs: track original extent subvol in a new inline ref Boris Burkov
@ 2023-05-03  0:59 ` Boris Burkov
  2023-05-03  0:59 ` [PATCH 5/9] btrfs: record simple quota deltas Boris Burkov
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Boris Burkov @ 2023-05-03  0:59 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

While data extents require us to store additional inline refs to track
the original owner on free, this information is available implicitly for
metadata. It is found in the owner field of the header of the tree
block. Even if other trees refer to this block and the original ref goes
away, we will not rewrite that header field, so it will reliabl give the
original owner.

To use it for recording simple quota deltas, we need to wire this root
id through from when we create the delayed ref until we fully process it
and know that we are deleteing a metadata extent. Store it in the
btrfs_tree_ref struct of the delayed tree ref.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/delayed-ref.c |  7 ++++---
 fs/btrfs/delayed-ref.h | 16 +++++++++++++++-
 fs/btrfs/extent-tree.c |  8 ++++----
 fs/btrfs/relocation.c  | 11 ++++++-----
 4 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 0b32432d7d56..f4c01965e1ea 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -838,7 +838,7 @@ add_delayed_ref_head(struct btrfs_trans_handle *trans,
 static void init_delayed_ref_common(struct btrfs_fs_info *fs_info,
 				    struct btrfs_delayed_ref_node *ref,
 				    u64 bytenr, u64 num_bytes, u64 ref_root,
-				    int action, u8 ref_type)
+				    int action, u8 ref_type, u64 allocation_owning_root)
 {
 	u64 seq = 0;
 
@@ -857,6 +857,7 @@ static void init_delayed_ref_common(struct btrfs_fs_info *fs_info,
 	ref->in_tree = 1;
 	ref->seq = seq;
 	ref->type = ref_type;
+	ref->allocation_owning_root = allocation_owning_root;
 	RB_CLEAR_NODE(&ref->ref_node);
 	INIT_LIST_HEAD(&ref->add_list);
 }
@@ -915,7 +916,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
 
 	init_delayed_ref_common(fs_info, &ref->node, bytenr, num_bytes,
 				generic_ref->tree_ref.owning_root, action,
-				ref_type);
+				ref_type, generic_ref->tree_ref.allocation_owning_root);
 	ref->root = generic_ref->tree_ref.owning_root;
 	ref->parent = parent;
 	ref->level = level;
@@ -989,7 +990,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
 	else
 	        ref_type = BTRFS_EXTENT_DATA_REF_KEY;
 	init_delayed_ref_common(fs_info, &ref->node, bytenr, num_bytes,
-				ref_root, action, ref_type);
+				ref_root, action, ref_type, ref_root);
 	ref->root = ref_root;
 	ref->parent = parent;
 	ref->objectid = owner;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index b54261fe509b..427de8a25eb2 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -32,6 +32,12 @@ struct btrfs_delayed_ref_node {
 	/* seq number to keep track of insertion order */
 	u64 seq;
 
+	/*
+	 * root which originally allocated this extent and owns it for
+	 * simple quota accounting purposes.
+	 */
+	u64 allocation_owning_root;
+
 	/* ref count on this data structure */
 	refcount_t refs;
 
@@ -216,6 +222,12 @@ struct btrfs_tree_ref {
 	u64 owning_root;
 
 	/* For non-skinny metadata, no special member needed */
+
+	/*
+	 * Root that originally allocated this block and owns the allocation for
+	 * simple quota accounting purposes.
+	 */
+	u64 allocation_owning_root;
 };
 
 struct btrfs_ref {
@@ -284,7 +296,8 @@ static inline void btrfs_init_generic_ref(struct btrfs_ref *generic_ref,
 }
 
 static inline void btrfs_init_tree_ref(struct btrfs_ref *generic_ref,
-				int level, u64 root, u64 mod_root, bool skip_qgroup)
+				int level, u64 root, u64 mod_root,
+				bool skip_qgroup, u64 allocation_owning_root)
 {
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	/* If @real_root not set, use @root as fallback */
@@ -292,6 +305,7 @@ static inline void btrfs_init_tree_ref(struct btrfs_ref *generic_ref,
 #endif
 	generic_ref->tree_ref.level = level;
 	generic_ref->tree_ref.owning_root = root;
+	generic_ref->tree_ref.allocation_owning_root = allocation_owning_root;
 	generic_ref->type = BTRFS_REF_METADATA;
 	if (skip_qgroup || !(is_fstree(root) &&
 			     (!mod_root || is_fstree(mod_root))))
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index b9a2f1e355b7..c821d22be2ca 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2446,7 +2446,7 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle *trans,
 			btrfs_init_generic_ref(&generic_ref, action, bytenr,
 					       num_bytes, parent);
 			btrfs_init_tree_ref(&generic_ref, level - 1, ref_root,
-					    root->root_key.objectid, for_reloc);
+					    root->root_key.objectid, for_reloc, ref_root);
 			if (inc)
 				ret = btrfs_inc_extent_ref(trans, &generic_ref);
 			else
@@ -3268,7 +3268,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 	btrfs_init_generic_ref(&generic_ref, BTRFS_DROP_DELAYED_REF,
 			       buf->start, buf->len, parent);
 	btrfs_init_tree_ref(&generic_ref, btrfs_header_level(buf),
-			    root_id, 0, false);
+			    root_id, 0, false, btrfs_header_owner(buf));
 
 	if (root_id != BTRFS_TREE_LOG_OBJECTID) {
 		btrfs_ref_tree_mod(fs_info, &generic_ref);
@@ -4952,7 +4952,7 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
 		btrfs_init_generic_ref(&generic_ref, BTRFS_ADD_DELAYED_EXTENT,
 				       ins.objectid, ins.offset, parent);
 		btrfs_init_tree_ref(&generic_ref, level, root_objectid,
-				    root->root_key.objectid, false);
+				    root->root_key.objectid, false, btrfs_header_owner(buf));
 		btrfs_ref_tree_mod(fs_info, &generic_ref);
 		ret = btrfs_add_delayed_tree_ref(trans, &generic_ref, extent_op);
 		if (ret)
@@ -5374,7 +5374,7 @@ static noinline int do_walk_down(struct btrfs_trans_handle *trans,
 		btrfs_init_generic_ref(&ref, BTRFS_DROP_DELAYED_REF, bytenr,
 				       fs_info->nodesize, parent);
 		btrfs_init_tree_ref(&ref, level - 1, root->root_key.objectid,
-				    0, false);
+				    0, false, btrfs_header_owner(next));
 		ret = btrfs_free_extent(trans, &ref);
 		if (ret)
 			goto out_unlock;
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 09b1988d1791..0e981e8a374e 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1384,7 +1384,7 @@ int replace_path(struct btrfs_trans_handle *trans, struct reloc_control *rc,
 		btrfs_init_generic_ref(&ref, BTRFS_ADD_DELAYED_REF, old_bytenr,
 				       blocksize, path->nodes[level]->start);
 		btrfs_init_tree_ref(&ref, level - 1, src->root_key.objectid,
-				    0, true);
+				    0, true, src->root_key.objectid);
 		ret = btrfs_inc_extent_ref(trans, &ref);
 		if (ret) {
 			btrfs_abort_transaction(trans, ret);
@@ -1393,7 +1393,7 @@ int replace_path(struct btrfs_trans_handle *trans, struct reloc_control *rc,
 		btrfs_init_generic_ref(&ref, BTRFS_ADD_DELAYED_REF, new_bytenr,
 				       blocksize, 0);
 		btrfs_init_tree_ref(&ref, level - 1, dest->root_key.objectid, 0,
-				    true);
+				    true, dest->root_key.objectid);
 		ret = btrfs_inc_extent_ref(trans, &ref);
 		if (ret) {
 			btrfs_abort_transaction(trans, ret);
@@ -1403,7 +1403,7 @@ int replace_path(struct btrfs_trans_handle *trans, struct reloc_control *rc,
 		btrfs_init_generic_ref(&ref, BTRFS_DROP_DELAYED_REF, new_bytenr,
 				       blocksize, path->nodes[level]->start);
 		btrfs_init_tree_ref(&ref, level - 1, src->root_key.objectid,
-				    0, true);
+				    0, true, src->root_key.objectid);
 		ret = btrfs_free_extent(trans, &ref);
 		if (ret) {
 			btrfs_abort_transaction(trans, ret);
@@ -1413,7 +1413,7 @@ int replace_path(struct btrfs_trans_handle *trans, struct reloc_control *rc,
 		btrfs_init_generic_ref(&ref, BTRFS_DROP_DELAYED_REF, old_bytenr,
 				       blocksize, 0);
 		btrfs_init_tree_ref(&ref, level - 1, dest->root_key.objectid,
-				    0, true);
+				    0, true, src->root_key.objectid);
 		ret = btrfs_free_extent(trans, &ref);
 		if (ret) {
 			btrfs_abort_transaction(trans, ret);
@@ -2494,7 +2494,8 @@ static int do_relocation(struct btrfs_trans_handle *trans,
 					       upper->eb->start);
 			btrfs_init_tree_ref(&ref, node->level,
 					    btrfs_header_owner(upper->eb),
-					    root->root_key.objectid, false);
+					    root->root_key.objectid, false,
+					    btrfs_header_owner(upper->eb));
 			ret = btrfs_inc_extent_ref(trans, &ref);
 			if (!ret)
 				ret = btrfs_drop_subtree(trans, root, eb,
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 5/9] btrfs: record simple quota deltas
  2023-05-03  0:59 [PATCH RFC 0/9] btrfs: simple quotas Boris Burkov
                   ` (3 preceding siblings ...)
  2023-05-03  0:59 ` [PATCH 4/9] btrfs: track metadata owning root in delayed refs Boris Burkov
@ 2023-05-03  0:59 ` Boris Burkov
  2023-05-03  0:59 ` [PATCH 6/9] btrfs: auto hierarchy for simple qgroups of nested subvols Boris Burkov
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Boris Burkov @ 2023-05-03  0:59 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

At the moment that we run delayed refs, we make the final ref-count
based decision on creating/removing extent (and metadata) items.
Therefore, it is exactly the spot to hook up simple quotas.

There are a few important subtleties to the fields we must collect to
accurately track simple quotas, particularly when removing an extent.
When removing a data extent, the ref could be in any tree (due to
reflink, for example) and so we need to recover the owning root id from
the owner ref item. When removing a metadata extent, we know the owning
root from the owner field in the header when we create the delayed ref,
so we can recover it from there.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/delayed-ref.c |  6 ++++
 fs/btrfs/delayed-ref.h | 12 +++++++
 fs/btrfs/extent-tree.c | 82 ++++++++++++++++++++++++++++++++++++++----
 fs/btrfs/qgroup.c      | 11 +++---
 fs/btrfs/qgroup.h      |  2 ++
 fs/btrfs/transaction.c |  5 +++
 6 files changed, 106 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index f4c01965e1ea..c06e5c8bcdc1 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -732,6 +732,12 @@ static void init_delayed_ref_head(struct btrfs_delayed_ref_head *head_ref,
 	head_ref->bytenr = bytenr;
 	head_ref->num_bytes = num_bytes;
 	head_ref->ref_mod = count_mod;
+	head_ref->ref_root = 0;
+	head_ref->reserved_bytes = 0;
+	if (ref_root && reserved) {
+		head_ref->ref_root = ref_root;
+		head_ref->reserved_bytes = reserved;
+	}
 	head_ref->must_insert_reserved = must_insert_reserved;
 	head_ref->is_data = is_data;
 	head_ref->is_system = is_system;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 427de8a25eb2..64aba4601de4 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -107,6 +107,18 @@ struct btrfs_delayed_ref_head {
 	 */
 	int ref_mod;
 
+	/*
+	 * Track the root which accounted for the reservation relevant to
+	 * must_insert_reserved. In simple quota mode, we need to drop
+	 * reservations for such head refs when we drop delayed refs.
+	 */
+	u64 ref_root;
+	/*
+	 * Track reserved bytes when setting must_insert_reserved.
+	 * On success or cleanup, we will need to free the reservation.
+	 */
+	u64 reserved_bytes;
+
 	/*
 	 * when a new extent is allocated, it is just reserved in memory
 	 * The actual extent isn't inserted into the extent allocation tree
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index c821d22be2ca..7379ee04018d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1506,6 +1506,7 @@ static int __btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 }
 
 static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
+				struct btrfs_delayed_ref_head *href,
 				struct btrfs_delayed_ref_node *node,
 				struct btrfs_delayed_extent_op *extent_op,
 				int insert_reserved)
@@ -1529,12 +1530,23 @@ static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
 	ref_root = ref->root;
 
 	if (node->action == BTRFS_ADD_DELAYED_REF && insert_reserved) {
+		struct btrfs_simple_quota_delta delta = {
+			.root = ref_root,
+			.num_bytes = node->num_bytes,
+			.rsv_root = href->ref_root,
+			.rsv_bytes = href->reserved_bytes,
+			.is_data = true,
+			.is_inc	= true,
+		};
+
 		if (extent_op)
 			flags |= extent_op->flags_to_set;
 		ret = alloc_reserved_file_extent(trans, parent, ref_root,
 						 flags, ref->objectid,
 						 ref->offset, &ins,
 						 node->ref_mod);
+		if (!ret)
+			ret = btrfs_record_simple_quota_delta(trans->fs_info, &delta);
 	} else if (node->action == BTRFS_ADD_DELAYED_REF) {
 		ret = __btrfs_inc_extent_ref(trans, node, parent, ref_root,
 					     ref->objectid, ref->offset,
@@ -1661,6 +1673,7 @@ static int run_delayed_tree_ref(struct btrfs_trans_handle *trans,
 				int insert_reserved)
 {
 	int ret = 0;
+	struct btrfs_fs_info *fs_info = trans->fs_info;
 	struct btrfs_delayed_tree_ref *ref;
 	u64 parent = 0;
 	u64 ref_root = 0;
@@ -1680,8 +1693,19 @@ static int run_delayed_tree_ref(struct btrfs_trans_handle *trans,
 		return -EIO;
 	}
 	if (node->action == BTRFS_ADD_DELAYED_REF && insert_reserved) {
+		struct btrfs_simple_quota_delta delta = {
+			.root = ref_root,
+			.num_bytes = fs_info->nodesize,
+			.rsv_root = 0,
+			.rsv_bytes = 0,
+			.is_data = false,
+			.is_inc = true,
+		};
+
 		BUG_ON(!extent_op || !extent_op->update_flags);
 		ret = alloc_reserved_tree_block(trans, node, extent_op);
+		if (!ret)
+			btrfs_record_simple_quota_delta(fs_info, &delta);
 	} else if (node->action == BTRFS_ADD_DELAYED_REF) {
 		ret = __btrfs_inc_extent_ref(trans, node, parent, ref_root,
 					     ref->level, 0, 1, extent_op);
@@ -1696,6 +1720,7 @@ static int run_delayed_tree_ref(struct btrfs_trans_handle *trans,
 
 /* helper function to actually process a single delayed ref entry */
 static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
+			       struct btrfs_delayed_ref_head *href,
 			       struct btrfs_delayed_ref_node *node,
 			       struct btrfs_delayed_extent_op *extent_op,
 			       int insert_reserved)
@@ -1714,8 +1739,8 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
 					   insert_reserved);
 	else if (node->type == BTRFS_EXTENT_DATA_REF_KEY ||
 		 node->type == BTRFS_SHARED_DATA_REF_KEY)
-		ret = run_delayed_data_ref(trans, node, extent_op,
-					   insert_reserved);
+		ret = run_delayed_data_ref(trans, href, node,
+					   extent_op, insert_reserved);
 	else if (node->type == BTRFS_EXTENT_OWNER_REF_KEY)
 		ret = 0;
 	else
@@ -1812,6 +1837,11 @@ void btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info,
 		spin_unlock(&delayed_refs->lock);
 		nr_items += btrfs_csum_bytes_to_leaves(fs_info, head->num_bytes);
 	}
+	if (head->must_insert_reserved && head->is_data &&
+	    btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
+		btrfs_qgroup_free_refroot(fs_info, head->ref_root,
+					  head->reserved_bytes,
+					  BTRFS_QGROUP_RSV_DATA);
 
 	btrfs_delayed_refs_rsv_release(fs_info, nr_items);
 }
@@ -1959,8 +1989,8 @@ static int btrfs_run_delayed_refs_for_head(struct btrfs_trans_handle *trans,
 		locked_ref->extent_op = NULL;
 		spin_unlock(&locked_ref->lock);
 
-		ret = run_one_delayed_ref(trans, ref, extent_op,
-					  must_insert_reserved);
+		ret = run_one_delayed_ref(trans, locked_ref, ref,
+					  extent_op, must_insert_reserved);
 
 		btrfs_free_delayed_extent_op(extent_op);
 		if (ret) {
@@ -2826,11 +2856,12 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
 }
 
 static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
-				     u64 bytenr, u64 num_bytes, bool is_data)
+				     u64 bytenr, struct btrfs_simple_quota_delta *delta)
 {
 	int ret;
+	u64 num_bytes = delta->num_bytes;
 
-	if (is_data) {
+	if (delta->is_data) {
 		struct btrfs_root *csum_root;
 
 		csum_root = btrfs_csum_root(trans->fs_info, bytenr);
@@ -2841,6 +2872,12 @@ static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
 		}
 	}
 
+	ret = btrfs_record_simple_quota_delta(trans->fs_info, delta);
+	if (ret) {
+		btrfs_abort_transaction(trans, ret);
+		return ret;
+	}
+
 	ret = add_to_free_space_tree(trans, bytenr, num_bytes);
 	if (ret) {
 		btrfs_abort_transaction(trans, ret);
@@ -2936,6 +2973,7 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 	u64 bytenr = node->bytenr;
 	u64 num_bytes = node->num_bytes;
 	bool skinny_metadata = btrfs_fs_incompat(info, SKINNY_METADATA);
+	u64 delayed_ref_root = node->allocation_owning_root;
 
 	extent_root = btrfs_extent_root(info, bytenr);
 	ASSERT(extent_root);
@@ -3133,6 +3171,16 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 			}
 		}
 	} else {
+		struct btrfs_extent_owner_ref *oref;
+		struct btrfs_simple_quota_delta delta = {
+			.root = delayed_ref_root,
+			.num_bytes = num_bytes,
+			.rsv_root = 0,
+			.rsv_bytes = 0,
+			.is_data = is_data,
+			.is_inc = false,
+		};
+
 		/* In this branch refs == 1 */
 		if (found_extent) {
 			if (is_data && refs_to_drop !=
@@ -3170,6 +3218,26 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 				num_to_del = 2;
 			}
 		}
+		/*
+		 * We can't infer the data owner from the delayed ref, so we
+		 * need to try to get it from the owning ref item.
+		 *
+		 * If it is not present, then that extent was not written under
+		 * simple quotas mode, so we don't need to account for its
+		 * deletion.
+		 */
+		if (is_data && btrfs_fs_incompat(trans->fs_info, SIMPLE_QUOTA)) {
+			struct btrfs_extent_inline_ref *owning_iref;
+			int type;
+
+			owning_iref = (struct btrfs_extent_inline_ref *)(ei + 1);
+			type = btrfs_get_extent_inline_ref_type(leaf, owning_iref,
+								BTRFS_REF_TYPE_ANY);
+			if (type == BTRFS_EXTENT_OWNER_REF_KEY) {
+				oref = (struct btrfs_extent_owner_ref *)(&owning_iref->offset);
+				delta.root = btrfs_extent_owner_ref_root_id(leaf, oref);
+			}
+		}
 
 		ret = btrfs_del_items(trans, extent_root, path, path->slots[0],
 				      num_to_del);
@@ -3179,7 +3247,7 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 		}
 		btrfs_release_path(path);
 
-		ret = do_free_extent_accounting(trans, bytenr, num_bytes, is_data);
+		ret = do_free_extent_accounting(trans, bytenr, &delta);
 	}
 	btrfs_release_path(path);
 
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 8982b76ae9e5..e3d0630fef0c 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1655,6 +1655,9 @@ int btrfs_create_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid)
 	struct btrfs_qgroup *qgroup;
 	int ret = 0;
 
+	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_DISABLED)
+		return 0;
+
 	mutex_lock(&fs_info->qgroup_ioctl_lock);
 	if (!fs_info->quota_root) {
 		ret = -ENOTCONN;
@@ -4520,7 +4523,6 @@ int btrfs_record_simple_quota_delta(struct btrfs_fs_info *fs_info,
 	struct ulist_iterator uiter;
 	struct ulist_node *unode;
 	struct btrfs_qgroup *qg;
-	bool drop_rsv = false;
 	u64 root = delta->root;
 	u64 num_bytes = delta->num_bytes;
 	int sign = delta->is_inc ? 1 : -1;
@@ -4546,14 +4548,13 @@ int btrfs_record_simple_quota_delta(struct btrfs_fs_info *fs_info,
 		qg = unode_aux_to_qgroup(unode);
 		qg->excl += num_bytes * sign;
 		qg->rfer += num_bytes * sign;
-		if (delta->is_inc && delta->is_data)
-			drop_rsv = true;
 		qgroup_dirty(fs_info, qg);
 	}
 
 out:
 	spin_unlock(&fs_info->qgroup_lock);
-	if (!ret && drop_rsv)
-		btrfs_qgroup_free_refroot(fs_info, root, num_bytes, BTRFS_QGROUP_RSV_DATA);
+	if (!ret && delta->rsv_bytes && delta->rsv_root)
+		btrfs_qgroup_free_refroot(fs_info, delta->rsv_root,
+					  delta->rsv_bytes, BTRFS_QGROUP_RSV_DATA);
 	return ret;
 }
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 0d627a871900..b300998dcbc7 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -238,6 +238,8 @@ struct btrfs_qgroup {
 struct btrfs_simple_quota_delta {
 	u64 root; /* The fstree root this delta counts against */
 	u64 num_bytes; /* The number of bytes in the extent being counted */
+	u64 rsv_root; /* The fstree root this delta should free rsv for */
+	u64 rsv_bytes; /* The number of bytes reserved for this extent */
 	bool is_inc; /* Whether we are using or freeing the extent */
 	bool is_data; /* Whether the extent is data or metadata */
 };
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index e6d6752c2fca..0edfb58afd80 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1704,6 +1704,11 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
 	}
 	btrfs_release_path(path);
 
+	ret = btrfs_create_qgroup(trans, objectid);
+	if (ret) {
+		btrfs_abort_transaction(trans, ret);
+		goto fail;
+	}
 	/*
 	 * pull in the delayed directory update
 	 * and the delayed inode item
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 6/9] btrfs: auto hierarchy for simple qgroups of nested subvols
  2023-05-03  0:59 [PATCH RFC 0/9] btrfs: simple quotas Boris Burkov
                   ` (4 preceding siblings ...)
  2023-05-03  0:59 ` [PATCH 5/9] btrfs: record simple quota deltas Boris Burkov
@ 2023-05-03  0:59 ` Boris Burkov
  2023-05-03  4:47   ` Qu Wenruo
  2023-05-03  0:59 ` [PATCH 7/9] btrfs: check generation when recording simple quota delta Boris Burkov
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Boris Burkov @ 2023-05-03  0:59 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Consider the following sequence:
- enable quotas
- create subvol S id 256 at dir outer/
- create a qgroup 1/100
- add 0/256 (S's auto qgroup) to 1/100
- create subvol T id 257 at dir outer/inner/

With full qgroups, there is no relationship between 0/257 and either of
0/256 or 1/100. There is an inherit feature that the creator of inner/
can use to specify it ought to be in 1/100.

Simple quotas are targeted at container isolation, where such automatic
inheritance for not necessarily trusted/controlled nested subvol
creation would be quite helpful. Therefore, add a new default behavior
for simple quotas: when you create a nested subvol, automatically
inherit as parents any parents of the qgroup of the subvol the new inode
is going in.

In our example, 257/0 would also be under 1/100, allowing easy control
of a total quota over an arbitrary hierarchy of subvolumes.

I think this _might_ be a generally useful behavior, so it could be
interesting to put it behind a new inheritance flag that simple quotas
always use while traditional quotas let the user specify, but this is a
minimally intrusive change to start.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/ioctl.c       |  2 +-
 fs/btrfs/qgroup.c      | 46 +++++++++++++++++++++++++++++++++++++++---
 fs/btrfs/qgroup.h      |  6 +++---
 fs/btrfs/transaction.c |  2 +-
 4 files changed, 48 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index ca7d2ef739c8..4d6d28feb5c6 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -652,7 +652,7 @@ static noinline int create_subvol(struct mnt_idmap *idmap,
 	/* Tree log can't currently deal with an inode which is a new root. */
 	btrfs_set_log_full_commit(trans);
 
-	ret = btrfs_qgroup_inherit(trans, 0, objectid, inherit);
+	ret = btrfs_qgroup_inherit(trans, 0, objectid, root->root_key.objectid, inherit);
 	if (ret)
 		goto out;
 
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index e3d0630fef0c..6816e01f00b5 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1504,8 +1504,7 @@ static int quick_update_accounting(struct btrfs_fs_info *fs_info,
 	return ret;
 }
 
-int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src,
-			      u64 dst)
+int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src, u64 dst)
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
 	struct btrfs_qgroup *parent;
@@ -2945,6 +2944,42 @@ int btrfs_run_qgroups(struct btrfs_trans_handle *trans)
 	return ret;
 }
 
+static int qgroup_auto_inherit(struct btrfs_fs_info *fs_info,
+			       u64 inode_rootid,
+			       struct btrfs_qgroup_inherit **inherit)
+{
+	int i = 0;
+	u64 num_qgroups = 0;
+	struct btrfs_qgroup *inode_qg;
+	struct btrfs_qgroup_list *qg_list;
+
+	if (*inherit)
+		return -EEXIST;
+
+	inode_qg = find_qgroup_rb(fs_info, inode_rootid);
+	if (!inode_qg)
+		return -ENOENT;
+
+	list_for_each_entry(qg_list, &inode_qg->groups, next_group) {
+		++num_qgroups;
+	}
+
+	if (!num_qgroups)
+		return 0;
+
+	*inherit = kzalloc(sizeof(**inherit) + num_qgroups * sizeof(u64), GFP_NOFS);
+	if (!*inherit)
+		return -ENOMEM;
+	(*inherit)->num_qgroups = num_qgroups;
+
+	list_for_each_entry(qg_list, &inode_qg->groups, next_group) {
+		u64 qg_id = qg_list->group->qgroupid;
+		*((u64 *)((*inherit)+1) + i) = qg_id;
+	}
+
+	return 0;
+}
+
 /*
  * Copy the accounting information between qgroups. This is necessary
  * when a snapshot or a subvolume is created. Throwing an error will
@@ -2952,7 +2987,8 @@ int btrfs_run_qgroups(struct btrfs_trans_handle *trans)
  * when a readonly fs is a reasonable outcome.
  */
 int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
-			 u64 objectid, struct btrfs_qgroup_inherit *inherit)
+			 u64 objectid, u64 inode_rootid,
+			 struct btrfs_qgroup_inherit *inherit)
 {
 	int ret = 0;
 	int i;
@@ -2994,6 +3030,9 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
 		goto out;
 	}
 
+	if (!inherit && btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
+		qgroup_auto_inherit(fs_info, inode_rootid, &inherit);
+
 	if (inherit) {
 		i_qgroups = (u64 *)(inherit + 1);
 		nums = inherit->num_qgroups + 2 * inherit->num_ref_copies +
@@ -3020,6 +3059,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
 	if (ret)
 		goto out;
 
+
 	/*
 	 * add qgroup to all inherited groups
 	 */
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index b300998dcbc7..aecebe9d0d62 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -272,8 +272,7 @@ int btrfs_qgroup_rescan(struct btrfs_fs_info *fs_info);
 void btrfs_qgroup_rescan_resume(struct btrfs_fs_info *fs_info);
 int btrfs_qgroup_wait_for_completion(struct btrfs_fs_info *fs_info,
 				     bool interruptible);
-int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src,
-			      u64 dst);
+int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src, u64 dst);
 int btrfs_del_qgroup_relation(struct btrfs_trans_handle *trans, u64 src,
 			      u64 dst);
 int btrfs_create_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid);
@@ -367,7 +366,8 @@ int btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans, u64 bytenr,
 int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans);
 int btrfs_run_qgroups(struct btrfs_trans_handle *trans);
 int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
-			 u64 objectid, struct btrfs_qgroup_inherit *inherit);
+			 u64 objectid, u64 inode_rootid,
+			 struct btrfs_qgroup_inherit *inherit);
 void btrfs_qgroup_free_refroot(struct btrfs_fs_info *fs_info,
 			       u64 ref_root, u64 num_bytes,
 			       enum btrfs_qgroup_rsv_type type);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 0edfb58afd80..6befcf1b4b1f 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1557,7 +1557,7 @@ static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
 
 	/* Now qgroup are all updated, we can inherit it to new qgroups */
 	ret = btrfs_qgroup_inherit(trans, src->root_key.objectid, dst_objectid,
-				   inherit);
+				   parent->root_key.objectid, inherit);
 	if (ret < 0)
 		goto out;
 
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 7/9] btrfs: check generation when recording simple quota delta
  2023-05-03  0:59 [PATCH RFC 0/9] btrfs: simple quotas Boris Burkov
                   ` (5 preceding siblings ...)
  2023-05-03  0:59 ` [PATCH 6/9] btrfs: auto hierarchy for simple qgroups of nested subvols Boris Burkov
@ 2023-05-03  0:59 ` Boris Burkov
  2023-05-03  0:59 ` [PATCH 8/9] btrfs: expose the qgroup mode via sysfs Boris Burkov
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Boris Burkov @ 2023-05-03  0:59 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Simple quotas count extents only from the moment the feature is enabled.
Therefore, if we do something like:
1. create subvol S
2. write F in S
3. enable quotas
4. remove F
5. write G in S

then after 3. and 4. we would expect the simple quota usage of S to be 0
(putting aside some metadata extents that might be written) and after
5., it should be the size of G plus metadata. Therefore, we need to be
able to determine whether a particular quota delta we are processing
predates simple quota enablement.

To do this, store the transaction id when quotas were enabled. In
fs_info for immediate use and in the quota status item to make it
recoverable on mount. When we see a delta, check if the generation of
the extent item is less than that of quota enablement. If so, we should
ignore the delta from this extent.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/accessors.h            |  2 ++
 fs/btrfs/extent-tree.c          |  3 +++
 fs/btrfs/fs.h                   |  2 ++
 fs/btrfs/qgroup.c               | 20 ++++++++++++++++----
 fs/btrfs/qgroup.h               |  1 +
 include/uapi/linux/btrfs_tree.h |  7 +++++++
 6 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
index aab61312e4e8..8d122cc42cb7 100644
--- a/fs/btrfs/accessors.h
+++ b/fs/btrfs/accessors.h
@@ -971,6 +971,8 @@ BTRFS_SETGET_FUNCS(qgroup_status_flags, struct btrfs_qgroup_status_item,
 		   flags, 64);
 BTRFS_SETGET_FUNCS(qgroup_status_rescan, struct btrfs_qgroup_status_item,
 		   rescan, 64);
+BTRFS_SETGET_FUNCS(qgroup_status_enable_gen, struct btrfs_qgroup_status_item,
+		   enable_gen, 64);
 
 /* btrfs_qgroup_info_item */
 BTRFS_SETGET_FUNCS(qgroup_info_generation, struct btrfs_qgroup_info_item,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 7379ee04018d..7f48e7d34b09 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1537,6 +1537,7 @@ static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
 			.rsv_bytes = href->reserved_bytes,
 			.is_data = true,
 			.is_inc	= true,
+			.generation = trans->transid,
 		};
 
 		if (extent_op)
@@ -1700,6 +1701,7 @@ static int run_delayed_tree_ref(struct btrfs_trans_handle *trans,
 			.rsv_bytes = 0,
 			.is_data = false,
 			.is_inc = true,
+			.generation = trans->transid,
 		};
 
 		BUG_ON(!extent_op || !extent_op->update_flags);
@@ -3179,6 +3181,7 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 			.rsv_bytes = 0,
 			.is_data = is_data,
 			.is_inc = false,
+			.generation = btrfs_extent_generation(leaf, ei),
 		};
 
 		/* In this branch refs == 1 */
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 6c989d87768c..d67ee2652c87 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -803,6 +803,8 @@ struct btrfs_fs_info {
 	spinlock_t eb_leak_lock;
 	struct list_head allocated_ebs;
 #endif
+
+	u64 quota_enable_gen;
 };
 
 static inline void btrfs_set_last_root_drop_gen(struct btrfs_fs_info *fs_info,
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 6816e01f00b5..cd28e92d8a37 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -467,8 +467,9 @@ int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info)
 				btrfs_err(fs_info,
 					"qgroup generation mismatch, marked as inconsistent");
 			}
-			fs_info->qgroup_flags = btrfs_qgroup_status_flags(l,
-									  ptr);
+			if (btrfs_fs_incompat(fs_info, SIMPLE_QUOTA))
+				fs_info->quota_enable_gen = btrfs_qgroup_status_enable_gen(l, ptr);
+			fs_info->qgroup_flags = btrfs_qgroup_status_flags(l, ptr);
 			rescan_progress = btrfs_qgroup_status_rescan(l, ptr);
 			goto next1;
 		}
@@ -1114,6 +1115,10 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info,
 	ptr = btrfs_item_ptr(leaf, path->slots[0],
 				 struct btrfs_qgroup_status_item);
 	btrfs_set_qgroup_status_generation(leaf, ptr, trans->transid);
+	if (simple_qgroups) {
+		btrfs_set_fs_incompat(fs_info, SIMPLE_QUOTA);
+		btrfs_set_qgroup_status_enable_gen(leaf, ptr, trans->transid);
+	}
 	btrfs_set_qgroup_status_version(leaf, ptr, BTRFS_QGROUP_STATUS_VERSION);
 	fs_info->qgroup_flags = BTRFS_QGROUP_STATUS_FLAG_ON;
 	if (!simple_qgroups)
@@ -1209,6 +1214,8 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info,
 		goto out_free_path;
 	}
 
+	fs_info->quota_enable_gen = trans->transid;
+
 	mutex_unlock(&fs_info->qgroup_ioctl_lock);
 	/*
 	 * Commit the transaction while not holding qgroup_ioctl_lock, to avoid
@@ -1233,8 +1240,6 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info,
 	spin_lock(&fs_info->qgroup_lock);
 	fs_info->quota_root = quota_root;
 	set_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags);
-	if (simple_qgroups)
-		btrfs_set_fs_incompat(fs_info, SIMPLE_QUOTA);
 	spin_unlock(&fs_info->qgroup_lock);
 
 	/* Skip rescan for simple qgroups */
@@ -4573,6 +4578,13 @@ int btrfs_record_simple_quota_delta(struct btrfs_fs_info *fs_info,
 	if (!is_fstree(root))
 		return 0;
 
+	/*
+	 * If the extent predates enabling quotas, don't count it.
+	 * This is particularly likely when freeing old extents.
+	 */
+	if (delta->generation < fs_info->quota_enable_gen)
+		return 0;
+
 	spin_lock(&fs_info->qgroup_lock);
 	qgroup = find_qgroup_rb(fs_info, root);
 	if (!qgroup) {
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index aecebe9d0d62..9f3c397d51fb 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -242,6 +242,7 @@ struct btrfs_simple_quota_delta {
 	u64 rsv_bytes; /* The number of bytes reserved for this extent */
 	bool is_inc; /* Whether we are using or freeing the extent */
 	bool is_data; /* Whether the extent is data or metadata */
+	u64 generation; /* The generation the extent was created in */
 };
 
 static inline u64 btrfs_qgroup_subvolid(u64 qgroupid)
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 424c7f342712..7797560f0215 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -1230,6 +1230,13 @@ struct btrfs_qgroup_status_item {
 	 * of the scan. It contains a logical address
 	 */
 	__le64 rescan;
+
+	/*
+	 * the generation when quotas are enabled. Used by simple quotas to
+	 * avoid decrementing when freeing an extent that was written before
+	 * enable.
+	 */
+	__le64 enable_gen;
 } __attribute__ ((__packed__));
 
 struct btrfs_qgroup_info_item {
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 8/9] btrfs: expose the qgroup mode via sysfs
  2023-05-03  0:59 [PATCH RFC 0/9] btrfs: simple quotas Boris Burkov
                   ` (6 preceding siblings ...)
  2023-05-03  0:59 ` [PATCH 7/9] btrfs: check generation when recording simple quota delta Boris Burkov
@ 2023-05-03  0:59 ` Boris Burkov
  2023-05-03  0:59 ` [PATCH 9/9] btrfs: free qgroup rsv on io failure Boris Burkov
  2023-05-05  4:13 ` [PATCH RFC 0/9] btrfs: simple quotas Anand Jain
  9 siblings, 0 replies; 23+ messages in thread
From: Boris Burkov @ 2023-05-03  0:59 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Add a new sysfs file
/sys/fs/btrfs/<uuid>/qgroups/mode
which prints out the mode qgroups is running in. The possible modes are
disabled, qgroup, and squota

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/sysfs.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 25294e624851..d2de2207120a 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -2079,6 +2079,31 @@ static ssize_t qgroup_enabled_show(struct kobject *qgroups_kobj,
 }
 BTRFS_ATTR(qgroups, enabled, qgroup_enabled_show);
 
+static ssize_t qgroup_mode_show(struct kobject *qgroups_kobj,
+				struct kobj_attribute *a,
+				char *buf)
+{
+	struct btrfs_fs_info *fs_info = to_fs_info(qgroups_kobj->parent);
+	char *mode = "";
+
+	spin_lock(&fs_info->qgroup_lock);
+	switch (btrfs_qgroup_mode(fs_info)) {
+	case BTRFS_QGROUP_MODE_DISABLED:
+		mode = "disabled";
+		break;
+	case BTRFS_QGROUP_MODE_FULL:
+		mode = "qgroup";
+		break;
+	case BTRFS_QGROUP_MODE_SIMPLE:
+		mode = "squota";
+		break;
+	}
+	spin_unlock(&fs_info->qgroup_lock);
+
+	return sysfs_emit(buf, "%s\n", mode);
+}
+BTRFS_ATTR(qgroups, mode, qgroup_mode_show);
+
 static ssize_t qgroup_inconsistent_show(struct kobject *qgroups_kobj,
 					struct kobj_attribute *a,
 					char *buf)
@@ -2141,6 +2166,7 @@ static struct attribute *qgroups_attrs[] = {
 	BTRFS_ATTR_PTR(qgroups, enabled),
 	BTRFS_ATTR_PTR(qgroups, inconsistent),
 	BTRFS_ATTR_PTR(qgroups, drop_subtree_threshold),
+	BTRFS_ATTR_PTR(qgroups, mode),
 	NULL
 };
 ATTRIBUTE_GROUPS(qgroups);
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 9/9] btrfs: free qgroup rsv on io failure
  2023-05-03  0:59 [PATCH RFC 0/9] btrfs: simple quotas Boris Burkov
                   ` (7 preceding siblings ...)
  2023-05-03  0:59 ` [PATCH 8/9] btrfs: expose the qgroup mode via sysfs Boris Burkov
@ 2023-05-03  0:59 ` Boris Burkov
  2023-05-05  4:13 ` [PATCH RFC 0/9] btrfs: simple quotas Anand Jain
  9 siblings, 0 replies; 23+ messages in thread
From: Boris Burkov @ 2023-05-03  0:59 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

If we do a write whose bio suffers an error, we will never reclaim the
qgroup reserved space for it. We allocate the space in the write_iter
codepath, then release the reservation as we allocate the ordered
extent, but we only create a delayed ref if the ordered extent finishes.
If it has an error, we simply leak the rsv. This is apparent in running
any error injecting (dmerror) fstests like btrfs/146 or btrfs/160. Such
tests fail due to dmesg on umount complaining about the leaked qgroup
data space.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/ordered-data.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index a9778a91511e..e6803587a13c 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -426,8 +426,12 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
 			entry->bytes_left -= len;
 		}
 
-		if (!uptodate)
+		if (!uptodate) {
 			set_bit(BTRFS_ORDERED_IOERR, &entry->flags);
+			btrfs_qgroup_free_refroot(fs_info, inode->root->root_key.objectid,
+						  entry->qgroup_rsv, BTRFS_QGROUP_RSV_DATA);
+			entry->qgroup_rsv = 0;
+		}
 
 		/*
 		 * All the IO of the ordered extent is finished, we need to queue
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/9] btrfs: simple quotas mode
  2023-05-03  0:59 ` [PATCH 1/9] btrfs: simple quotas mode Boris Burkov
@ 2023-05-03  2:38   ` Qu Wenruo
  0 siblings, 0 replies; 23+ messages in thread
From: Qu Wenruo @ 2023-05-03  2:38 UTC (permalink / raw)
  To: Boris Burkov, linux-btrfs, kernel-team



On 2023/5/3 08:59, Boris Burkov wrote:
> Allow the quota enable ioctl to specify simple quotas. Set an INCOMPAT
> bit when simple quotas are enabled, as it will result in several
> breaking changes to the on-disk format.

No need for incompat, if you put all the simple quota things into a 
dedicated tree.

And it's no longer a recommended behavior to enable a new feature 
through ioctl/mount option anymore.

New features should be enabled or converted through mkfs or btrfstune 
instead.

Thanks,
Qu

> 
> Introduce an enum for capturing the current quota mode. Rather than just
> enabled/disabled, the possible settings are now disabled/simple/full.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>   fs/btrfs/fs.h              |  5 ++-
>   fs/btrfs/ioctl.c           |  2 +-
>   fs/btrfs/qgroup.c          | 68 ++++++++++++++++++++++++++++++--------
>   fs/btrfs/qgroup.h          | 10 +++++-
>   fs/btrfs/transaction.c     |  4 +--
>   include/uapi/linux/btrfs.h |  1 +
>   6 files changed, 69 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index 0d98fc5f6f44..6c989d87768c 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -218,7 +218,8 @@ enum {
>   	 BTRFS_FEATURE_INCOMPAT_NO_HOLES	|	\
>   	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID	|	\
>   	 BTRFS_FEATURE_INCOMPAT_RAID1C34	|	\
> -	 BTRFS_FEATURE_INCOMPAT_ZONED)
> +	 BTRFS_FEATURE_INCOMPAT_ZONED		|	\
> +	 BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA)
>   
>   #ifdef CONFIG_BTRFS_DEBUG
>   	/*
> @@ -233,7 +234,6 @@ enum {
>   
>   #define BTRFS_FEATURE_INCOMPAT_SUPP		\
>   	(BTRFS_FEATURE_INCOMPAT_SUPP_STABLE)
> -
>   #endif
>   
>   #define BTRFS_FEATURE_INCOMPAT_SAFE_SET			\
> @@ -791,7 +791,6 @@ struct btrfs_fs_info {
>   	struct lockdep_map btrfs_state_change_map[4];
>   	struct lockdep_map btrfs_trans_pending_ordered_map;
>   	struct lockdep_map btrfs_ordered_extent_map;
> -
>   #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>   	spinlock_t ref_verify_lock;
>   	struct rb_root block_tree;
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 9522669000a7..ca7d2ef739c8 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -3685,7 +3685,7 @@ static long btrfs_ioctl_quota_ctl(struct file *file, void __user *arg)
>   
>   	switch (sa->cmd) {
>   	case BTRFS_QUOTA_CTL_ENABLE:
> -		ret = btrfs_quota_enable(fs_info);
> +		ret = btrfs_quota_enable(fs_info, sa);
>   		break;
>   	case BTRFS_QUOTA_CTL_DISABLE:
>   		ret = btrfs_quota_disable(fs_info);
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index f41da7ac360d..3c8b296215ee 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -3,6 +3,7 @@
>    * Copyright (C) 2011 STRATO.  All rights reserved.
>    */
>   
> +#include <linux/btrfs.h>
>   #include <linux/sched.h>
>   #include <linux/pagemap.h>
>   #include <linux/writeback.h>
> @@ -10,7 +11,6 @@
>   #include <linux/rbtree.h>
>   #include <linux/slab.h>
>   #include <linux/workqueue.h>
> -#include <linux/btrfs.h>
>   #include <linux/sched/mm.h>
>   
>   #include "ctree.h"
> @@ -30,6 +30,15 @@
>   #include "root-tree.h"
>   #include "tree-checker.h"
>   
> +enum btrfs_qgroup_mode btrfs_qgroup_mode(struct btrfs_fs_info *fs_info)
> +{
> +	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
> +		return BTRFS_QGROUP_MODE_DISABLED;
> +	if (btrfs_fs_incompat(fs_info, SIMPLE_QUOTA))
> +		return BTRFS_QGROUP_MODE_SIMPLE;
> +	return BTRFS_QGROUP_MODE_FULL;
> +}
> +
>   /*
>    * Helpers to access qgroup reservation
>    *
> @@ -340,6 +349,8 @@ int btrfs_verify_qgroup_counts(struct btrfs_fs_info *fs_info, u64 qgroupid,
>   
>   static void qgroup_mark_inconsistent(struct btrfs_fs_info *fs_info)
>   {
> +	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
> +		return;
>   	fs_info->qgroup_flags |= (BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT |
>   				  BTRFS_QGROUP_RUNTIME_FLAG_CANCEL_RESCAN |
>   				  BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING);
> @@ -412,7 +423,8 @@ int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info)
>   				goto out;
>   			}
>   			if (btrfs_qgroup_status_generation(l, ptr) !=
> -			    fs_info->generation) {
> +			    fs_info->generation &&
> +			    !btrfs_fs_incompat(fs_info, SIMPLE_QUOTA)) {
>   				qgroup_mark_inconsistent(fs_info);
>   				btrfs_err(fs_info,
>   					"qgroup generation mismatch, marked as inconsistent");
> @@ -949,7 +961,8 @@ static int btrfs_clean_quota_tree(struct btrfs_trans_handle *trans,
>   	return ret;
>   }
>   
> -int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
> +int btrfs_quota_enable(struct btrfs_fs_info *fs_info,
> +		       struct btrfs_ioctl_quota_ctl_args *quota_ctl_args)
>   {
>   	struct btrfs_root *quota_root;
>   	struct btrfs_root *tree_root = fs_info->tree_root;
> @@ -961,6 +974,7 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
>   	struct btrfs_qgroup *qgroup = NULL;
>   	struct btrfs_trans_handle *trans = NULL;
>   	struct ulist *ulist = NULL;
> +	bool simple_qgroups = quota_ctl_args->status == 42;
>   	int ret = 0;
>   	int slot;
>   
> @@ -1063,8 +1077,9 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
>   				 struct btrfs_qgroup_status_item);
>   	btrfs_set_qgroup_status_generation(leaf, ptr, trans->transid);
>   	btrfs_set_qgroup_status_version(leaf, ptr, BTRFS_QGROUP_STATUS_VERSION);
> -	fs_info->qgroup_flags = BTRFS_QGROUP_STATUS_FLAG_ON |
> -				BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
> +	fs_info->qgroup_flags = BTRFS_QGROUP_STATUS_FLAG_ON;
> +	if (!simple_qgroups)
> +		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
>   	btrfs_set_qgroup_status_flags(leaf, ptr, fs_info->qgroup_flags &
>   				      BTRFS_QGROUP_STATUS_FLAGS_MASK);
>   	btrfs_set_qgroup_status_rescan(leaf, ptr, 0);
> @@ -1180,8 +1195,14 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
>   	spin_lock(&fs_info->qgroup_lock);
>   	fs_info->quota_root = quota_root;
>   	set_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags);
> +	if (simple_qgroups)
> +		btrfs_set_fs_incompat(fs_info, SIMPLE_QUOTA);
>   	spin_unlock(&fs_info->qgroup_lock);
>   
> +	/* Skip rescan for simple qgroups */
> +	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
> +		goto out_free_path;
> +
>   	ret = qgroup_rescan_init(fs_info, 0, 1);
>   	if (!ret) {
>   	        qgroup_rescan_zero_tracking(fs_info);
> @@ -1766,6 +1787,9 @@ int btrfs_qgroup_trace_extent_nolock(struct btrfs_fs_info *fs_info,
>   	struct btrfs_qgroup_extent_record *entry;
>   	u64 bytenr = record->bytenr;
>   
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
> +		return 0;
> +
>   	lockdep_assert_held(&delayed_refs->lock);
>   	trace_btrfs_qgroup_trace_extent(fs_info, record);
>   
> @@ -1798,6 +1822,8 @@ int btrfs_qgroup_trace_extent_post(struct btrfs_trans_handle *trans,
>   	struct btrfs_backref_walk_ctx ctx = { 0 };
>   	int ret;
>   
> +	if (btrfs_qgroup_mode(trans->fs_info) != BTRFS_QGROUP_MODE_FULL)
> +		return 0;
>   	/*
>   	 * We are always called in a context where we are already holding a
>   	 * transaction handle. Often we are called when adding a data delayed
> @@ -1853,7 +1879,7 @@ int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans, u64 bytenr,
>   	struct btrfs_delayed_ref_root *delayed_refs;
>   	int ret;
>   
> -	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags)
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL
>   	    || bytenr == 0 || num_bytes == 0)
>   		return 0;
>   	record = kzalloc(sizeof(*record), GFP_NOFS);
> @@ -1886,7 +1912,7 @@ int btrfs_qgroup_trace_leaf_items(struct btrfs_trans_handle *trans,
>   	u64 bytenr, num_bytes;
>   
>   	/* We can be called directly from walk_up_proc() */
> -	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
>   		return 0;
>   
>   	for (i = 0; i < nr; i++) {
> @@ -2262,7 +2288,7 @@ static int qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans,
>   	int level;
>   	int ret;
>   
> -	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
>   		return 0;
>   
>   	/* Wrong parameter order */
> @@ -2319,7 +2345,7 @@ int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle *trans,
>   	BUG_ON(root_level < 0 || root_level >= BTRFS_MAX_LEVEL);
>   	BUG_ON(root_eb == NULL);
>   
> -	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
>   		return 0;
>   
>   	spin_lock(&fs_info->qgroup_lock);
> @@ -2659,7 +2685,7 @@ int btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans, u64 bytenr,
>   	 * If quotas get disabled meanwhile, the resources need to be freed and
>   	 * we can't just exit here.
>   	 */
> -	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags) ||
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL ||
>   	    fs_info->qgroup_flags & BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING)
>   		goto out_free;
>   
> @@ -2747,6 +2773,9 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans)
>   	u64 qgroup_to_skip;
>   	int ret = 0;
>   
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
> +		return 0;
> +
>   	delayed_refs = &trans->transaction->delayed_refs;
>   	qgroup_to_skip = delayed_refs->qgroup_to_skip;
>   	while ((node = rb_first(&delayed_refs->dirty_extent_root))) {
> @@ -2989,11 +3018,10 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
>   		qgroup_dirty(fs_info, dstgroup);
>   	}
>   
> -	if (srcid) {
> +	if (srcid && btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_FULL) {
>   		srcgroup = find_qgroup_rb(fs_info, srcid);
>   		if (!srcgroup)
>   			goto unlock;
> -
>   		/*
>   		 * We call inherit after we clone the root in order to make sure
>   		 * our counts don't go crazy, so at this point the only
> @@ -3281,6 +3309,9 @@ static int qgroup_rescan_leaf(struct btrfs_trans_handle *trans,
>   	int slot;
>   	int ret;
>   
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
> +		return 1;
> +
>   	mutex_lock(&fs_info->qgroup_rescan_lock);
>   	extent_root = btrfs_extent_root(fs_info,
>   				fs_info->qgroup_rescan_progress.objectid);
> @@ -3378,6 +3409,9 @@ static void btrfs_qgroup_rescan_worker(struct btrfs_work *work)
>   	bool stopped = false;
>   	bool did_leaf_rescans = false;
>   
> +	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
> +		return;
> +
>   	path = btrfs_alloc_path();
>   	if (!path)
>   		goto out;
> @@ -3481,6 +3515,12 @@ qgroup_rescan_init(struct btrfs_fs_info *fs_info, u64 progress_objectid,
>   {
>   	int ret = 0;
>   
> +	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE) {
> +		btrfs_warn(fs_info, "qgroup rescan init failed, running in simple mode. mode: %d\n",
> +			btrfs_qgroup_mode(fs_info));
> +		return -EINVAL;
> +	}
> +
>   	if (!init_flags) {
>   		/* we're resuming qgroup rescan at mount time */
>   		if (!(fs_info->qgroup_flags &
> @@ -4240,7 +4280,7 @@ int btrfs_qgroup_add_swapped_blocks(struct btrfs_trans_handle *trans,
>   	int level = btrfs_header_level(subvol_parent) - 1;
>   	int ret = 0;
>   
> -	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
>   		return 0;
>   
>   	if (btrfs_node_ptr_generation(subvol_parent, subvol_slot) >
> @@ -4350,7 +4390,7 @@ int btrfs_qgroup_trace_subtree_after_cow(struct btrfs_trans_handle *trans,
>   	int ret = 0;
>   	int i;
>   
> -	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
>   		return 0;
>   	if (!is_fstree(root->root_key.objectid) || !root->reloc_root)
>   		return 0;
> diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
> index 7bffa10589d6..d4c4d039585f 100644
> --- a/fs/btrfs/qgroup.h
> +++ b/fs/btrfs/qgroup.h
> @@ -249,7 +249,15 @@ enum {
>   	ENUM_BIT(QGROUP_FREE),
>   };
>   
> -int btrfs_quota_enable(struct btrfs_fs_info *fs_info);
> +enum btrfs_qgroup_mode {
> +	BTRFS_QGROUP_MODE_DISABLED,
> +	BTRFS_QGROUP_MODE_FULL,
> +	BTRFS_QGROUP_MODE_SIMPLE
> +};
> +
> +enum btrfs_qgroup_mode btrfs_qgroup_mode(struct btrfs_fs_info *fs_info);
> +int btrfs_quota_enable(struct btrfs_fs_info *fs_info,
> +		       struct btrfs_ioctl_quota_ctl_args *quota_ctl_args);
>   int btrfs_quota_disable(struct btrfs_fs_info *fs_info);
>   int btrfs_qgroup_rescan(struct btrfs_fs_info *fs_info);
>   void btrfs_qgroup_rescan_resume(struct btrfs_fs_info *fs_info);
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 8b6a99b8d7f6..e6d6752c2fca 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1514,11 +1514,11 @@ static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
>   	int ret;
>   
>   	/*
> -	 * Save some performance in the case that qgroups are not
> +	 * Save some performance in the case that full qgroups are not
>   	 * enabled. If this check races with the ioctl, rescan will
>   	 * kick in anyway.
>   	 */
> -	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
>   		return 0;
>   
>   	/*
> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> index dbb8b96da50d..957ca4037974 100644
> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -333,6 +333,7 @@ struct btrfs_ioctl_fs_info_args {
>   #define BTRFS_FEATURE_INCOMPAT_RAID1C34		(1ULL << 11)
>   #define BTRFS_FEATURE_INCOMPAT_ZONED		(1ULL << 12)
>   #define BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2	(1ULL << 13)
> +#define BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA	(1ULL << 14)
>   
>   struct btrfs_ioctl_feature_flags {
>   	__u64 compat_flags;

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/9] btrfs: track original extent subvol in a new inline ref
  2023-05-03  0:59 ` [PATCH 3/9] btrfs: track original extent subvol in a new inline ref Boris Burkov
@ 2023-05-03  3:17   ` Qu Wenruo
  2023-05-04 16:17     ` Boris Burkov
  2023-05-10 16:57     ` David Sterba
  0 siblings, 2 replies; 23+ messages in thread
From: Qu Wenruo @ 2023-05-03  3:17 UTC (permalink / raw)
  To: Boris Burkov, linux-btrfs, kernel-team



On 2023/5/3 08:59, Boris Burkov wrote:
> In order to implement simple quota groups, we need to be able to
> associate a data extent with the subvolume that created it. Once you
> account for reflink, this information cannot be recovered without
> explicitly storing it. Options for storing it are:
> - a new key/item
> - a new extent inline ref item
> 
> The former is backwards compatible, but wastes space, the latter is
> incompat, but is efficient in space and reuses the existing inline ref
> machinery, while only abusing it a tiny amount -- specifically, the new
> item is not a ref, per-se.

Even we introduce new extent tree items, we can still mark the fs compat_ro.

As long as we don't do any writes, we can still read the fs without any 
compatibility problem, and the enable/disable should be addressed by 
btrfstune/mkfs anyway.

Thanks,
Qu
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>   fs/btrfs/accessors.h            |  4 +++
>   fs/btrfs/backref.c              |  3 ++
>   fs/btrfs/extent-tree.c          | 50 +++++++++++++++++++++++++--------
>   fs/btrfs/print-tree.c           | 12 ++++++++
>   fs/btrfs/ref-verify.c           |  3 ++
>   fs/btrfs/tree-checker.c         |  3 ++
>   include/uapi/linux/btrfs_tree.h |  6 ++++
>   7 files changed, 70 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
> index ceadfc5d6c66..aab61312e4e8 100644
> --- a/fs/btrfs/accessors.h
> +++ b/fs/btrfs/accessors.h
> @@ -350,6 +350,8 @@ BTRFS_SETGET_FUNCS(extent_data_ref_count, struct btrfs_extent_data_ref, count, 3
>   
>   BTRFS_SETGET_FUNCS(shared_data_ref_count, struct btrfs_shared_data_ref, count, 32);
>   
> +BTRFS_SETGET_FUNCS(extent_owner_ref_root_id, struct btrfs_extent_owner_ref, root_id, 64);
> +
>   BTRFS_SETGET_FUNCS(extent_inline_ref_type, struct btrfs_extent_inline_ref,
>   		   type, 8);
>   BTRFS_SETGET_FUNCS(extent_inline_ref_offset, struct btrfs_extent_inline_ref,
> @@ -366,6 +368,8 @@ static inline u32 btrfs_extent_inline_ref_size(int type)
>   	if (type == BTRFS_EXTENT_DATA_REF_KEY)
>   		return sizeof(struct btrfs_extent_data_ref) +
>   		       offsetof(struct btrfs_extent_inline_ref, offset);
> +	if (type == BTRFS_EXTENT_OWNER_REF_KEY)
> +		return sizeof(struct btrfs_extent_inline_ref);
>   	return 0;
>   }
>   
> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> index e54f0884802a..8cd8ed6c572f 100644
> --- a/fs/btrfs/backref.c
> +++ b/fs/btrfs/backref.c
> @@ -1128,6 +1128,9 @@ static int add_inline_refs(struct btrfs_backref_walk_ctx *ctx,
>   						       count, sc, GFP_NOFS);
>   			break;
>   		}
> +		case BTRFS_EXTENT_OWNER_REF_KEY:
> +			WARN_ON(!btrfs_fs_incompat(ctx->fs_info, SIMPLE_QUOTA));
> +			break;
>   		default:
>   			WARN_ON(1);
>   		}
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 5cd289de4e92..b9a2f1e355b7 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -363,9 +363,13 @@ int btrfs_get_extent_inline_ref_type(const struct extent_buffer *eb,
>   				     struct btrfs_extent_inline_ref *iref,
>   				     enum btrfs_inline_ref_type is_data)
>   {
> +	struct btrfs_fs_info *fs_info = eb->fs_info;
>   	int type = btrfs_extent_inline_ref_type(eb, iref);
>   	u64 offset = btrfs_extent_inline_ref_offset(eb, iref);
>   
> +	if (type == BTRFS_EXTENT_OWNER_REF_KEY && btrfs_fs_incompat(fs_info, SIMPLE_QUOTA))
> +		return type;
> +
>   	if (type == BTRFS_TREE_BLOCK_REF_KEY ||
>   	    type == BTRFS_SHARED_BLOCK_REF_KEY ||
>   	    type == BTRFS_SHARED_DATA_REF_KEY ||
> @@ -374,26 +378,25 @@ int btrfs_get_extent_inline_ref_type(const struct extent_buffer *eb,
>   			if (type == BTRFS_TREE_BLOCK_REF_KEY)
>   				return type;
>   			if (type == BTRFS_SHARED_BLOCK_REF_KEY) {
> -				ASSERT(eb->fs_info);
> +				ASSERT(fs_info);
>   				/*
>   				 * Every shared one has parent tree block,
>   				 * which must be aligned to sector size.
>   				 */
> -				if (offset &&
> -				    IS_ALIGNED(offset, eb->fs_info->sectorsize))
> +				if (offset && IS_ALIGNED(offset, fs_info->sectorsize))
>   					return type;
>   			}
>   		} else if (is_data == BTRFS_REF_TYPE_DATA) {
>   			if (type == BTRFS_EXTENT_DATA_REF_KEY)
>   				return type;
>   			if (type == BTRFS_SHARED_DATA_REF_KEY) {
> -				ASSERT(eb->fs_info);
> +				ASSERT(fs_info);
>   				/*
>   				 * Every shared one has parent tree block,
>   				 * which must be aligned to sector size.
>   				 */
>   				if (offset &&
> -				    IS_ALIGNED(offset, eb->fs_info->sectorsize))
> +				    IS_ALIGNED(offset, fs_info->sectorsize))
>   					return type;
>   			}
>   		} else {
> @@ -403,7 +406,7 @@ int btrfs_get_extent_inline_ref_type(const struct extent_buffer *eb,
>   	}
>   
>   	btrfs_print_leaf((struct extent_buffer *)eb);
> -	btrfs_err(eb->fs_info,
> +	btrfs_err(fs_info,
>   		  "eb %llu iref 0x%lx invalid extent inline ref type %d",
>   		  eb->start, (unsigned long)iref, type);
>   	WARN_ON(1);
> @@ -912,6 +915,11 @@ int lookup_inline_extent_backref(struct btrfs_trans_handle *trans,
>   		}
>   		iref = (struct btrfs_extent_inline_ref *)ptr;
>   		type = btrfs_get_extent_inline_ref_type(leaf, iref, needed);
> +		if (type == BTRFS_EXTENT_OWNER_REF_KEY) {
> +			WARN_ON(!btrfs_fs_incompat(fs_info, SIMPLE_QUOTA));
> +			ptr += btrfs_extent_inline_ref_size(type);
> +			continue;
> +		}
>   		if (type == BTRFS_REF_TYPE_INVALID) {
>   			err = -EUCLEAN;
>   			goto out;
> @@ -1708,6 +1716,8 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
>   		 node->type == BTRFS_SHARED_DATA_REF_KEY)
>   		ret = run_delayed_data_ref(trans, node, extent_op,
>   					   insert_reserved);
> +	else if (node->type == BTRFS_EXTENT_OWNER_REF_KEY)
> +		ret = 0;
>   	else
>   		BUG();
>   	if (ret && insert_reserved)
> @@ -2275,6 +2285,7 @@ static noinline int check_committed_ref(struct btrfs_root *root,
>   	struct btrfs_extent_item *ei;
>   	struct btrfs_key key;
>   	u32 item_size;
> +	u32 expected_size;
>   	int type;
>   	int ret;
>   
> @@ -2301,10 +2312,17 @@ static noinline int check_committed_ref(struct btrfs_root *root,
>   	ret = 1;
>   	item_size = btrfs_item_size(leaf, path->slots[0]);
>   	ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item);
> +	expected_size = sizeof(*ei) + btrfs_extent_inline_ref_size(BTRFS_EXTENT_DATA_REF_KEY);
> +
> +	iref = (struct btrfs_extent_inline_ref *)(ei + 1);
> +	type = btrfs_get_extent_inline_ref_type(leaf, iref, BTRFS_REF_TYPE_DATA);
> +	if (btrfs_fs_incompat(fs_info, SIMPLE_QUOTA) && type == BTRFS_EXTENT_OWNER_REF_KEY) {
> +		expected_size += btrfs_extent_inline_ref_size(BTRFS_EXTENT_OWNER_REF_KEY);
> +		iref = (struct btrfs_extent_inline_ref *)(iref + 1);
> +	}
>   
>   	/* If extent item has more than 1 inline ref then it's shared */
> -	if (item_size != sizeof(*ei) +
> -	    btrfs_extent_inline_ref_size(BTRFS_EXTENT_DATA_REF_KEY))
> +	if (item_size != expected_size)
>   		goto out;
>   
>   	/*
> @@ -2316,8 +2334,6 @@ static noinline int check_committed_ref(struct btrfs_root *root,
>   	     btrfs_root_last_snapshot(&root->root_item)))
>   		goto out;
>   
> -	iref = (struct btrfs_extent_inline_ref *)(ei + 1);
> -
>   	/* If this extent has SHARED_DATA_REF then it's shared */
>   	type = btrfs_get_extent_inline_ref_type(leaf, iref, BTRFS_REF_TYPE_DATA);
>   	if (type != BTRFS_EXTENT_DATA_REF_KEY)
> @@ -4572,6 +4588,7 @@ static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
>   	struct btrfs_root *extent_root;
>   	int ret;
>   	struct btrfs_extent_item *extent_item;
> +	struct btrfs_extent_owner_ref *oref;
>   	struct btrfs_extent_inline_ref *iref;
>   	struct btrfs_path *path;
>   	struct extent_buffer *leaf;
> @@ -4583,7 +4600,10 @@ static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
>   	else
>   		type = BTRFS_EXTENT_DATA_REF_KEY;
>   
> -	size = sizeof(*extent_item) + btrfs_extent_inline_ref_size(type);
> +	size = sizeof(*extent_item);
> +	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
> +		size += btrfs_extent_inline_ref_size(BTRFS_EXTENT_OWNER_REF_KEY);
> +	size += btrfs_extent_inline_ref_size(type);
>   
>   	path = btrfs_alloc_path();
>   	if (!path)
> @@ -4604,8 +4624,16 @@ static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
>   	btrfs_set_extent_flags(leaf, extent_item,
>   			       flags | BTRFS_EXTENT_FLAG_DATA);
>   
> +
>   	iref = (struct btrfs_extent_inline_ref *)(extent_item + 1);
> +	if (btrfs_fs_incompat(fs_info, SIMPLE_QUOTA)) {
> +		btrfs_set_extent_inline_ref_type(leaf, iref, BTRFS_EXTENT_OWNER_REF_KEY);
> +		oref = (struct btrfs_extent_owner_ref *)(&iref->offset);
> +		btrfs_set_extent_owner_ref_root_id(leaf, oref, root_objectid);
> +		iref = (struct btrfs_extent_inline_ref *)(oref + 1);
> +	}
>   	btrfs_set_extent_inline_ref_type(leaf, iref, type);
> +
>   	if (parent > 0) {
>   		struct btrfs_shared_data_ref *ref;
>   		ref = (struct btrfs_shared_data_ref *)(iref + 1);
> diff --git a/fs/btrfs/print-tree.c b/fs/btrfs/print-tree.c
> index b93c96213304..1114cd915bd8 100644
> --- a/fs/btrfs/print-tree.c
> +++ b/fs/btrfs/print-tree.c
> @@ -80,12 +80,20 @@ static void print_extent_data_ref(struct extent_buffer *eb,
>   	       btrfs_extent_data_ref_count(eb, ref));
>   }
>   
> +static void print_extent_owner_ref(struct extent_buffer *eb,
> +				   struct btrfs_extent_owner_ref *ref)
> +{
> +	WARN_ON(!btrfs_fs_incompat(eb->fs_info, SIMPLE_QUOTA));
> +	pr_cont("extent data owner root %llu\n", btrfs_extent_owner_ref_root_id(eb, ref));
> +}
> +
>   static void print_extent_item(struct extent_buffer *eb, int slot, int type)
>   {
>   	struct btrfs_extent_item *ei;
>   	struct btrfs_extent_inline_ref *iref;
>   	struct btrfs_extent_data_ref *dref;
>   	struct btrfs_shared_data_ref *sref;
> +	struct btrfs_extent_owner_ref *oref;
>   	struct btrfs_disk_key key;
>   	unsigned long end;
>   	unsigned long ptr;
> @@ -159,6 +167,10 @@ static void print_extent_item(struct extent_buffer *eb, int slot, int type)
>   			"\t\t\t(parent %llu not aligned to sectorsize %u)\n",
>   				     offset, eb->fs_info->sectorsize);
>   			break;
> +		case BTRFS_EXTENT_OWNER_REF_KEY:
> +			oref = (struct btrfs_extent_owner_ref *)(&iref->offset);
> +			print_extent_owner_ref(eb, oref);
> +			break;
>   		default:
>   			pr_cont("(extent %llu has INVALID ref type %d)\n",
>   				  eb->start, type);
> diff --git a/fs/btrfs/ref-verify.c b/fs/btrfs/ref-verify.c
> index 95d28497de7c..9edc87eaff1f 100644
> --- a/fs/btrfs/ref-verify.c
> +++ b/fs/btrfs/ref-verify.c
> @@ -485,6 +485,9 @@ static int process_extent_item(struct btrfs_fs_info *fs_info,
>   			ret = add_shared_data_ref(fs_info, offset, count,
>   						  key->objectid, key->offset);
>   			break;
> +		case BTRFS_EXTENT_OWNER_REF_KEY:
> +			WARN_ON(!btrfs_fs_incompat(fs_info, SIMPLE_QUOTA));
> +			break;
>   		default:
>   			btrfs_err(fs_info, "invalid key type in iref");
>   			ret = -EINVAL;
> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> index e2b54793bf0c..27d4230a38a8 100644
> --- a/fs/btrfs/tree-checker.c
> +++ b/fs/btrfs/tree-checker.c
> @@ -1451,6 +1451,9 @@ static int check_extent_item(struct extent_buffer *leaf,
>   			}
>   			inline_refs += btrfs_shared_data_ref_count(leaf, sref);
>   			break;
> +		case BTRFS_EXTENT_OWNER_REF_KEY:
> +			WARN_ON(!btrfs_fs_incompat(fs_info, SIMPLE_QUOTA));
> +			break;
>   		default:
>   			extent_err(leaf, slot, "unknown inline ref type: %u",
>   				   inline_type);
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index ab38d0f411fa..424c7f342712 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -226,6 +226,8 @@
>   
>   #define BTRFS_SHARED_DATA_REF_KEY	184
>   
> +#define BTRFS_EXTENT_OWNER_REF_KEY	190
> +
>   /*
>    * block groups give us hints into the extent allocation trees.  Which
>    * blocks are free etc etc
> @@ -783,6 +785,10 @@ struct btrfs_shared_data_ref {
>   	__le32 count;
>   } __attribute__ ((__packed__));
>   
> +struct btrfs_extent_owner_ref {
> +	u64 root_id;
> +} __attribute__ ((__packed__));
> +
>   struct btrfs_extent_inline_ref {
>   	__u8 type;
>   	__le64 offset;

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 6/9] btrfs: auto hierarchy for simple qgroups of nested subvols
  2023-05-03  0:59 ` [PATCH 6/9] btrfs: auto hierarchy for simple qgroups of nested subvols Boris Burkov
@ 2023-05-03  4:47   ` Qu Wenruo
  2023-05-04 16:34     ` Boris Burkov
  0 siblings, 1 reply; 23+ messages in thread
From: Qu Wenruo @ 2023-05-03  4:47 UTC (permalink / raw)
  To: Boris Burkov, linux-btrfs, kernel-team



On 2023/5/3 08:59, Boris Burkov wrote:
> Consider the following sequence:
> - enable quotas
> - create subvol S id 256 at dir outer/
> - create a qgroup 1/100
> - add 0/256 (S's auto qgroup) to 1/100
> - create subvol T id 257 at dir outer/inner/
> 
> With full qgroups, there is no relationship between 0/257 and either of
> 0/256 or 1/100. There is an inherit feature that the creator of inner/
> can use to specify it ought to be in 1/100.
> 
> Simple quotas are targeted at container isolation, where such automatic
> inheritance for not necessarily trusted/controlled nested subvol
> creation would be quite helpful. Therefore, add a new default behavior
> for simple quotas: when you create a nested subvol, automatically
> inherit as parents any parents of the qgroup of the subvol the new inode
> is going in.
> 
> In our example, 257/0 would also be under 1/100, allowing easy control
> of a total quota over an arbitrary hierarchy of subvolumes.
> 
> I think this _might_ be a generally useful behavior, so it could be
> interesting to put it behind a new inheritance flag that simple quotas
> always use while traditional quotas let the user specify, but this is a
> minimally intrusive change to start.

I'm not sure if the automatically qgroup inherit is a good idea.

Consider the following case, a subvolume is created under another 
subvolume, then it got inherited into qgroup 1/0.

But don't forget that a subvolume can be easily moved to other 
directory, should we also change which qgroup it belongs to?


Another point is, if a container manager tool wants to do the same 
inherit behavior, it's not that hard to query the owner group of the 
parent directory, then pass "-i" option for "btrfs subvolume create" or 
"btrfs subvolume snapshot" commands.

As the old saying goes, kernel should only provide a mechanism, not a 
policy. To me, automatically inherit qgroup owner looks more like a policy.

Thanks,
Qu

> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>   fs/btrfs/ioctl.c       |  2 +-
>   fs/btrfs/qgroup.c      | 46 +++++++++++++++++++++++++++++++++++++++---
>   fs/btrfs/qgroup.h      |  6 +++---
>   fs/btrfs/transaction.c |  2 +-
>   4 files changed, 48 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index ca7d2ef739c8..4d6d28feb5c6 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -652,7 +652,7 @@ static noinline int create_subvol(struct mnt_idmap *idmap,
>   	/* Tree log can't currently deal with an inode which is a new root. */
>   	btrfs_set_log_full_commit(trans);
>   
> -	ret = btrfs_qgroup_inherit(trans, 0, objectid, inherit);
> +	ret = btrfs_qgroup_inherit(trans, 0, objectid, root->root_key.objectid, inherit);
>   	if (ret)
>   		goto out;
>   
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index e3d0630fef0c..6816e01f00b5 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -1504,8 +1504,7 @@ static int quick_update_accounting(struct btrfs_fs_info *fs_info,
>   	return ret;
>   }
>   
> -int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src,
> -			      u64 dst)
> +int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src, u64 dst)
>   {
>   	struct btrfs_fs_info *fs_info = trans->fs_info;
>   	struct btrfs_qgroup *parent;
> @@ -2945,6 +2944,42 @@ int btrfs_run_qgroups(struct btrfs_trans_handle *trans)
>   	return ret;
>   }
>   
> +static int qgroup_auto_inherit(struct btrfs_fs_info *fs_info,
> +			       u64 inode_rootid,
> +			       struct btrfs_qgroup_inherit **inherit)
> +{
> +	int i = 0;
> +	u64 num_qgroups = 0;
> +	struct btrfs_qgroup *inode_qg;
> +	struct btrfs_qgroup_list *qg_list;
> +
> +	if (*inherit)
> +		return -EEXIST;
> +
> +	inode_qg = find_qgroup_rb(fs_info, inode_rootid);
> +	if (!inode_qg)
> +		return -ENOENT;
> +
> +	list_for_each_entry(qg_list, &inode_qg->groups, next_group) {
> +		++num_qgroups;
> +	}
> +
> +	if (!num_qgroups)
> +		return 0;
> +
> +	*inherit = kzalloc(sizeof(**inherit) + num_qgroups * sizeof(u64), GFP_NOFS);
> +	if (!*inherit)
> +		return -ENOMEM;
> +	(*inherit)->num_qgroups = num_qgroups;
> +
> +	list_for_each_entry(qg_list, &inode_qg->groups, next_group) {
> +		u64 qg_id = qg_list->group->qgroupid;
> +		*((u64 *)((*inherit)+1) + i) = qg_id;
> +	}
> +
> +	return 0;
> +}
> +
>   /*
>    * Copy the accounting information between qgroups. This is necessary
>    * when a snapshot or a subvolume is created. Throwing an error will
> @@ -2952,7 +2987,8 @@ int btrfs_run_qgroups(struct btrfs_trans_handle *trans)
>    * when a readonly fs is a reasonable outcome.
>    */
>   int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
> -			 u64 objectid, struct btrfs_qgroup_inherit *inherit)
> +			 u64 objectid, u64 inode_rootid,
> +			 struct btrfs_qgroup_inherit *inherit)
>   {
>   	int ret = 0;
>   	int i;
> @@ -2994,6 +3030,9 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
>   		goto out;
>   	}
>   
> +	if (!inherit && btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
> +		qgroup_auto_inherit(fs_info, inode_rootid, &inherit);
> +
>   	if (inherit) {
>   		i_qgroups = (u64 *)(inherit + 1);
>   		nums = inherit->num_qgroups + 2 * inherit->num_ref_copies +
> @@ -3020,6 +3059,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
>   	if (ret)
>   		goto out;
>   
> +
>   	/*
>   	 * add qgroup to all inherited groups
>   	 */
> diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
> index b300998dcbc7..aecebe9d0d62 100644
> --- a/fs/btrfs/qgroup.h
> +++ b/fs/btrfs/qgroup.h
> @@ -272,8 +272,7 @@ int btrfs_qgroup_rescan(struct btrfs_fs_info *fs_info);
>   void btrfs_qgroup_rescan_resume(struct btrfs_fs_info *fs_info);
>   int btrfs_qgroup_wait_for_completion(struct btrfs_fs_info *fs_info,
>   				     bool interruptible);
> -int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src,
> -			      u64 dst);
> +int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src, u64 dst);
>   int btrfs_del_qgroup_relation(struct btrfs_trans_handle *trans, u64 src,
>   			      u64 dst);
>   int btrfs_create_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid);
> @@ -367,7 +366,8 @@ int btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans, u64 bytenr,
>   int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans);
>   int btrfs_run_qgroups(struct btrfs_trans_handle *trans);
>   int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
> -			 u64 objectid, struct btrfs_qgroup_inherit *inherit);
> +			 u64 objectid, u64 inode_rootid,
> +			 struct btrfs_qgroup_inherit *inherit);
>   void btrfs_qgroup_free_refroot(struct btrfs_fs_info *fs_info,
>   			       u64 ref_root, u64 num_bytes,
>   			       enum btrfs_qgroup_rsv_type type);
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 0edfb58afd80..6befcf1b4b1f 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1557,7 +1557,7 @@ static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
>   
>   	/* Now qgroup are all updated, we can inherit it to new qgroups */
>   	ret = btrfs_qgroup_inherit(trans, src->root_key.objectid, dst_objectid,
> -				   inherit);
> +				   parent->root_key.objectid, inherit);
>   	if (ret < 0)
>   		goto out;
>   

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/9] btrfs: track original extent subvol in a new inline ref
  2023-05-03  3:17   ` Qu Wenruo
@ 2023-05-04 16:17     ` Boris Burkov
  2023-05-04 21:49       ` Qu Wenruo
  2023-05-10 16:57     ` David Sterba
  1 sibling, 1 reply; 23+ messages in thread
From: Boris Burkov @ 2023-05-04 16:17 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, kernel-team

On Wed, May 03, 2023 at 11:17:12AM +0800, Qu Wenruo wrote:
> 
> 
> On 2023/5/3 08:59, Boris Burkov wrote:
> > In order to implement simple quota groups, we need to be able to
> > associate a data extent with the subvolume that created it. Once you
> > account for reflink, this information cannot be recovered without
> > explicitly storing it. Options for storing it are:
> > - a new key/item
> > - a new extent inline ref item
> > 
> > The former is backwards compatible, but wastes space, the latter is
> > incompat, but is efficient in space and reuses the existing inline ref
> > machinery, while only abusing it a tiny amount -- specifically, the new
> > item is not a ref, per-se.
> 
> Even we introduce new extent tree items, we can still mark the fs compat_ro.
> 
> As long as we don't do any writes, we can still read the fs without any
> compatibility problem, and the enable/disable should be addressed by
> btrfstune/mkfs anyway.

Unfortunately, I don't believe compat_ro is possible with this design.
Because of how inline ref items are implemented, there is a lot of code
that makes assumptions about the extent item size, and the inline ref
item size based on their type. The best example that definitely breaks
things rather than maybe just warning is check_extent in tree-checker.c

With a new unparseable ref item inserted in the sequence of refs, that
code will either overflow or detect padding. The size calculation comes
up 0, etc. Perhaps there is a clever way to trick it, but I have not
seen it yet.

I was able to make it compat_ro by introducing an entirely new item for
the owner ref, but that comes with a per extent disk usage tradeoff that
is fairly steep for storing just a single u64.

> 
> Thanks,
> Qu
> > 
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >   fs/btrfs/accessors.h            |  4 +++
> >   fs/btrfs/backref.c              |  3 ++
> >   fs/btrfs/extent-tree.c          | 50 +++++++++++++++++++++++++--------
> >   fs/btrfs/print-tree.c           | 12 ++++++++
> >   fs/btrfs/ref-verify.c           |  3 ++
> >   fs/btrfs/tree-checker.c         |  3 ++
> >   include/uapi/linux/btrfs_tree.h |  6 ++++
> >   7 files changed, 70 insertions(+), 11 deletions(-)
> > 
> > diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
> > index ceadfc5d6c66..aab61312e4e8 100644
> > --- a/fs/btrfs/accessors.h
> > +++ b/fs/btrfs/accessors.h
> > @@ -350,6 +350,8 @@ BTRFS_SETGET_FUNCS(extent_data_ref_count, struct btrfs_extent_data_ref, count, 3
> >   BTRFS_SETGET_FUNCS(shared_data_ref_count, struct btrfs_shared_data_ref, count, 32);
> > +BTRFS_SETGET_FUNCS(extent_owner_ref_root_id, struct btrfs_extent_owner_ref, root_id, 64);
> > +
> >   BTRFS_SETGET_FUNCS(extent_inline_ref_type, struct btrfs_extent_inline_ref,
> >   		   type, 8);
> >   BTRFS_SETGET_FUNCS(extent_inline_ref_offset, struct btrfs_extent_inline_ref,
> > @@ -366,6 +368,8 @@ static inline u32 btrfs_extent_inline_ref_size(int type)
> >   	if (type == BTRFS_EXTENT_DATA_REF_KEY)
> >   		return sizeof(struct btrfs_extent_data_ref) +
> >   		       offsetof(struct btrfs_extent_inline_ref, offset);
> > +	if (type == BTRFS_EXTENT_OWNER_REF_KEY)
> > +		return sizeof(struct btrfs_extent_inline_ref);
> >   	return 0;
> >   }
> > diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> > index e54f0884802a..8cd8ed6c572f 100644
> > --- a/fs/btrfs/backref.c
> > +++ b/fs/btrfs/backref.c
> > @@ -1128,6 +1128,9 @@ static int add_inline_refs(struct btrfs_backref_walk_ctx *ctx,
> >   						       count, sc, GFP_NOFS);
> >   			break;
> >   		}
> > +		case BTRFS_EXTENT_OWNER_REF_KEY:
> > +			WARN_ON(!btrfs_fs_incompat(ctx->fs_info, SIMPLE_QUOTA));
> > +			break;
> >   		default:
> >   			WARN_ON(1);
> >   		}
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 5cd289de4e92..b9a2f1e355b7 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -363,9 +363,13 @@ int btrfs_get_extent_inline_ref_type(const struct extent_buffer *eb,
> >   				     struct btrfs_extent_inline_ref *iref,
> >   				     enum btrfs_inline_ref_type is_data)
> >   {
> > +	struct btrfs_fs_info *fs_info = eb->fs_info;
> >   	int type = btrfs_extent_inline_ref_type(eb, iref);
> >   	u64 offset = btrfs_extent_inline_ref_offset(eb, iref);
> > +	if (type == BTRFS_EXTENT_OWNER_REF_KEY && btrfs_fs_incompat(fs_info, SIMPLE_QUOTA))
> > +		return type;
> > +
> >   	if (type == BTRFS_TREE_BLOCK_REF_KEY ||
> >   	    type == BTRFS_SHARED_BLOCK_REF_KEY ||
> >   	    type == BTRFS_SHARED_DATA_REF_KEY ||
> > @@ -374,26 +378,25 @@ int btrfs_get_extent_inline_ref_type(const struct extent_buffer *eb,
> >   			if (type == BTRFS_TREE_BLOCK_REF_KEY)
> >   				return type;
> >   			if (type == BTRFS_SHARED_BLOCK_REF_KEY) {
> > -				ASSERT(eb->fs_info);
> > +				ASSERT(fs_info);
> >   				/*
> >   				 * Every shared one has parent tree block,
> >   				 * which must be aligned to sector size.
> >   				 */
> > -				if (offset &&
> > -				    IS_ALIGNED(offset, eb->fs_info->sectorsize))
> > +				if (offset && IS_ALIGNED(offset, fs_info->sectorsize))
> >   					return type;
> >   			}
> >   		} else if (is_data == BTRFS_REF_TYPE_DATA) {
> >   			if (type == BTRFS_EXTENT_DATA_REF_KEY)
> >   				return type;
> >   			if (type == BTRFS_SHARED_DATA_REF_KEY) {
> > -				ASSERT(eb->fs_info);
> > +				ASSERT(fs_info);
> >   				/*
> >   				 * Every shared one has parent tree block,
> >   				 * which must be aligned to sector size.
> >   				 */
> >   				if (offset &&
> > -				    IS_ALIGNED(offset, eb->fs_info->sectorsize))
> > +				    IS_ALIGNED(offset, fs_info->sectorsize))
> >   					return type;
> >   			}
> >   		} else {
> > @@ -403,7 +406,7 @@ int btrfs_get_extent_inline_ref_type(const struct extent_buffer *eb,
> >   	}
> >   	btrfs_print_leaf((struct extent_buffer *)eb);
> > -	btrfs_err(eb->fs_info,
> > +	btrfs_err(fs_info,
> >   		  "eb %llu iref 0x%lx invalid extent inline ref type %d",
> >   		  eb->start, (unsigned long)iref, type);
> >   	WARN_ON(1);
> > @@ -912,6 +915,11 @@ int lookup_inline_extent_backref(struct btrfs_trans_handle *trans,
> >   		}
> >   		iref = (struct btrfs_extent_inline_ref *)ptr;
> >   		type = btrfs_get_extent_inline_ref_type(leaf, iref, needed);
> > +		if (type == BTRFS_EXTENT_OWNER_REF_KEY) {
> > +			WARN_ON(!btrfs_fs_incompat(fs_info, SIMPLE_QUOTA));
> > +			ptr += btrfs_extent_inline_ref_size(type);
> > +			continue;
> > +		}
> >   		if (type == BTRFS_REF_TYPE_INVALID) {
> >   			err = -EUCLEAN;
> >   			goto out;
> > @@ -1708,6 +1716,8 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
> >   		 node->type == BTRFS_SHARED_DATA_REF_KEY)
> >   		ret = run_delayed_data_ref(trans, node, extent_op,
> >   					   insert_reserved);
> > +	else if (node->type == BTRFS_EXTENT_OWNER_REF_KEY)
> > +		ret = 0;
> >   	else
> >   		BUG();
> >   	if (ret && insert_reserved)
> > @@ -2275,6 +2285,7 @@ static noinline int check_committed_ref(struct btrfs_root *root,
> >   	struct btrfs_extent_item *ei;
> >   	struct btrfs_key key;
> >   	u32 item_size;
> > +	u32 expected_size;
> >   	int type;
> >   	int ret;
> > @@ -2301,10 +2312,17 @@ static noinline int check_committed_ref(struct btrfs_root *root,
> >   	ret = 1;
> >   	item_size = btrfs_item_size(leaf, path->slots[0]);
> >   	ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item);
> > +	expected_size = sizeof(*ei) + btrfs_extent_inline_ref_size(BTRFS_EXTENT_DATA_REF_KEY);
> > +
> > +	iref = (struct btrfs_extent_inline_ref *)(ei + 1);
> > +	type = btrfs_get_extent_inline_ref_type(leaf, iref, BTRFS_REF_TYPE_DATA);
> > +	if (btrfs_fs_incompat(fs_info, SIMPLE_QUOTA) && type == BTRFS_EXTENT_OWNER_REF_KEY) {
> > +		expected_size += btrfs_extent_inline_ref_size(BTRFS_EXTENT_OWNER_REF_KEY);
> > +		iref = (struct btrfs_extent_inline_ref *)(iref + 1);
> > +	}
> >   	/* If extent item has more than 1 inline ref then it's shared */
> > -	if (item_size != sizeof(*ei) +
> > -	    btrfs_extent_inline_ref_size(BTRFS_EXTENT_DATA_REF_KEY))
> > +	if (item_size != expected_size)
> >   		goto out;
> >   	/*
> > @@ -2316,8 +2334,6 @@ static noinline int check_committed_ref(struct btrfs_root *root,
> >   	     btrfs_root_last_snapshot(&root->root_item)))
> >   		goto out;
> > -	iref = (struct btrfs_extent_inline_ref *)(ei + 1);
> > -
> >   	/* If this extent has SHARED_DATA_REF then it's shared */
> >   	type = btrfs_get_extent_inline_ref_type(leaf, iref, BTRFS_REF_TYPE_DATA);
> >   	if (type != BTRFS_EXTENT_DATA_REF_KEY)
> > @@ -4572,6 +4588,7 @@ static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
> >   	struct btrfs_root *extent_root;
> >   	int ret;
> >   	struct btrfs_extent_item *extent_item;
> > +	struct btrfs_extent_owner_ref *oref;
> >   	struct btrfs_extent_inline_ref *iref;
> >   	struct btrfs_path *path;
> >   	struct extent_buffer *leaf;
> > @@ -4583,7 +4600,10 @@ static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
> >   	else
> >   		type = BTRFS_EXTENT_DATA_REF_KEY;
> > -	size = sizeof(*extent_item) + btrfs_extent_inline_ref_size(type);
> > +	size = sizeof(*extent_item);
> > +	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
> > +		size += btrfs_extent_inline_ref_size(BTRFS_EXTENT_OWNER_REF_KEY);
> > +	size += btrfs_extent_inline_ref_size(type);
> >   	path = btrfs_alloc_path();
> >   	if (!path)
> > @@ -4604,8 +4624,16 @@ static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
> >   	btrfs_set_extent_flags(leaf, extent_item,
> >   			       flags | BTRFS_EXTENT_FLAG_DATA);
> > +
> >   	iref = (struct btrfs_extent_inline_ref *)(extent_item + 1);
> > +	if (btrfs_fs_incompat(fs_info, SIMPLE_QUOTA)) {
> > +		btrfs_set_extent_inline_ref_type(leaf, iref, BTRFS_EXTENT_OWNER_REF_KEY);
> > +		oref = (struct btrfs_extent_owner_ref *)(&iref->offset);
> > +		btrfs_set_extent_owner_ref_root_id(leaf, oref, root_objectid);
> > +		iref = (struct btrfs_extent_inline_ref *)(oref + 1);
> > +	}
> >   	btrfs_set_extent_inline_ref_type(leaf, iref, type);
> > +
> >   	if (parent > 0) {
> >   		struct btrfs_shared_data_ref *ref;
> >   		ref = (struct btrfs_shared_data_ref *)(iref + 1);
> > diff --git a/fs/btrfs/print-tree.c b/fs/btrfs/print-tree.c
> > index b93c96213304..1114cd915bd8 100644
> > --- a/fs/btrfs/print-tree.c
> > +++ b/fs/btrfs/print-tree.c
> > @@ -80,12 +80,20 @@ static void print_extent_data_ref(struct extent_buffer *eb,
> >   	       btrfs_extent_data_ref_count(eb, ref));
> >   }
> > +static void print_extent_owner_ref(struct extent_buffer *eb,
> > +				   struct btrfs_extent_owner_ref *ref)
> > +{
> > +	WARN_ON(!btrfs_fs_incompat(eb->fs_info, SIMPLE_QUOTA));
> > +	pr_cont("extent data owner root %llu\n", btrfs_extent_owner_ref_root_id(eb, ref));
> > +}
> > +
> >   static void print_extent_item(struct extent_buffer *eb, int slot, int type)
> >   {
> >   	struct btrfs_extent_item *ei;
> >   	struct btrfs_extent_inline_ref *iref;
> >   	struct btrfs_extent_data_ref *dref;
> >   	struct btrfs_shared_data_ref *sref;
> > +	struct btrfs_extent_owner_ref *oref;
> >   	struct btrfs_disk_key key;
> >   	unsigned long end;
> >   	unsigned long ptr;
> > @@ -159,6 +167,10 @@ static void print_extent_item(struct extent_buffer *eb, int slot, int type)
> >   			"\t\t\t(parent %llu not aligned to sectorsize %u)\n",
> >   				     offset, eb->fs_info->sectorsize);
> >   			break;
> > +		case BTRFS_EXTENT_OWNER_REF_KEY:
> > +			oref = (struct btrfs_extent_owner_ref *)(&iref->offset);
> > +			print_extent_owner_ref(eb, oref);
> > +			break;
> >   		default:
> >   			pr_cont("(extent %llu has INVALID ref type %d)\n",
> >   				  eb->start, type);
> > diff --git a/fs/btrfs/ref-verify.c b/fs/btrfs/ref-verify.c
> > index 95d28497de7c..9edc87eaff1f 100644
> > --- a/fs/btrfs/ref-verify.c
> > +++ b/fs/btrfs/ref-verify.c
> > @@ -485,6 +485,9 @@ static int process_extent_item(struct btrfs_fs_info *fs_info,
> >   			ret = add_shared_data_ref(fs_info, offset, count,
> >   						  key->objectid, key->offset);
> >   			break;
> > +		case BTRFS_EXTENT_OWNER_REF_KEY:
> > +			WARN_ON(!btrfs_fs_incompat(fs_info, SIMPLE_QUOTA));
> > +			break;
> >   		default:
> >   			btrfs_err(fs_info, "invalid key type in iref");
> >   			ret = -EINVAL;
> > diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> > index e2b54793bf0c..27d4230a38a8 100644
> > --- a/fs/btrfs/tree-checker.c
> > +++ b/fs/btrfs/tree-checker.c
> > @@ -1451,6 +1451,9 @@ static int check_extent_item(struct extent_buffer *leaf,
> >   			}
> >   			inline_refs += btrfs_shared_data_ref_count(leaf, sref);
> >   			break;
> > +		case BTRFS_EXTENT_OWNER_REF_KEY:
> > +			WARN_ON(!btrfs_fs_incompat(fs_info, SIMPLE_QUOTA));
> > +			break;
> >   		default:
> >   			extent_err(leaf, slot, "unknown inline ref type: %u",
> >   				   inline_type);
> > diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> > index ab38d0f411fa..424c7f342712 100644
> > --- a/include/uapi/linux/btrfs_tree.h
> > +++ b/include/uapi/linux/btrfs_tree.h
> > @@ -226,6 +226,8 @@
> >   #define BTRFS_SHARED_DATA_REF_KEY	184
> > +#define BTRFS_EXTENT_OWNER_REF_KEY	190
> > +
> >   /*
> >    * block groups give us hints into the extent allocation trees.  Which
> >    * blocks are free etc etc
> > @@ -783,6 +785,10 @@ struct btrfs_shared_data_ref {
> >   	__le32 count;
> >   } __attribute__ ((__packed__));
> > +struct btrfs_extent_owner_ref {
> > +	u64 root_id;
> > +} __attribute__ ((__packed__));
> > +
> >   struct btrfs_extent_inline_ref {
> >   	__u8 type;
> >   	__le64 offset;
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 6/9] btrfs: auto hierarchy for simple qgroups of nested subvols
  2023-05-03  4:47   ` Qu Wenruo
@ 2023-05-04 16:34     ` Boris Burkov
  2023-05-04 21:55       ` Qu Wenruo
  0 siblings, 1 reply; 23+ messages in thread
From: Boris Burkov @ 2023-05-04 16:34 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, kernel-team

On Wed, May 03, 2023 at 12:47:40PM +0800, Qu Wenruo wrote:
> 
> 
> On 2023/5/3 08:59, Boris Burkov wrote:
> > Consider the following sequence:
> > - enable quotas
> > - create subvol S id 256 at dir outer/
> > - create a qgroup 1/100
> > - add 0/256 (S's auto qgroup) to 1/100
> > - create subvol T id 257 at dir outer/inner/
> > 
> > With full qgroups, there is no relationship between 0/257 and either of
> > 0/256 or 1/100. There is an inherit feature that the creator of inner/
> > can use to specify it ought to be in 1/100.
> > 
> > Simple quotas are targeted at container isolation, where such automatic
> > inheritance for not necessarily trusted/controlled nested subvol
> > creation would be quite helpful. Therefore, add a new default behavior
> > for simple quotas: when you create a nested subvol, automatically
> > inherit as parents any parents of the qgroup of the subvol the new inode
> > is going in.
> > 
> > In our example, 257/0 would also be under 1/100, allowing easy control
> > of a total quota over an arbitrary hierarchy of subvolumes.
> > 
> > I think this _might_ be a generally useful behavior, so it could be
> > interesting to put it behind a new inheritance flag that simple quotas
> > always use while traditional quotas let the user specify, but this is a
> > minimally intrusive change to start.
> 
> I'm not sure if the automatically qgroup inherit is a good idea.
> 
> Consider the following case, a subvolume is created under another subvolume,
> then it got inherited into qgroup 1/0.

I am not sure I follow this concern. We automatically inherit
relationships of the inode containing subvolume, not its qgroup.

So if X/0 is in a different qgroup Q/1 and subvol Y is created with its
inode in X, Y/0 will be in Q/1, not X/0. I don't believe we would ever
introduce a dependency that would violate the level invariants.

> 
> But don't forget that a subvolume can be easily moved to other directory,
> should we also change which qgroup it belongs to?

I think this is a great point and I haven't spent enough time thinking
about it. Currently, the answer is no, but that might be surprising to a
user.

I imagine it gets even worse if you mount the same subvolume multiple times..

> 
> 
> Another point is, if a container manager tool wants to do the same inherit
> behavior, it's not that hard to query the owner group of the parent
> directory, then pass "-i" option for "btrfs subvolume create" or "btrfs
> subvolume snapshot" commands.
> 
> As the old saying goes, kernel should only provide a mechanism, not a
> policy. To me, automatically inherit qgroup owner looks more like a policy.

This is targeted at the case where the container management tool does
not own creating a subvol.

If we did it your way, then the second something running in a container
creates a subvol (a reasonable operation, for faster bulk deletes, e.g.),
they would trivially "escape" their disk limit. At that point you would
have to do convoluted things like LD_PRELOAD or convincing all users to
use a subvol creation util/lib (might be running some code we can't or
don't want to modify too..)

If we can do it automatically in the kernel, there really is value
there.

IMO, a useful analogy here is cgroups. When you put a process in a
cgroup, all its child processes automatically get rolled up into the
same cgroup, which is important for the same reasons as what I described
above. You can then do weird stuff and move procs to a different cgroup,
but by default everything is neatly nested, which I think is a good
behavior for simple quotas too.

If the functionality is not totally unpalatable, but it's too confusing
to make it default for only simple quotas, I can add an ioctl or subvol
creation flag or something that opts a subvol in to having its nested
subvols inherit its parent qgroups. That would be something the
container runtime could use to opt in.

Thanks for the review,
Boris

> 
> Thanks,
> Qu
> 
> > 
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >   fs/btrfs/ioctl.c       |  2 +-
> >   fs/btrfs/qgroup.c      | 46 +++++++++++++++++++++++++++++++++++++++---
> >   fs/btrfs/qgroup.h      |  6 +++---
> >   fs/btrfs/transaction.c |  2 +-
> >   4 files changed, 48 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > index ca7d2ef739c8..4d6d28feb5c6 100644
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -652,7 +652,7 @@ static noinline int create_subvol(struct mnt_idmap *idmap,
> >   	/* Tree log can't currently deal with an inode which is a new root. */
> >   	btrfs_set_log_full_commit(trans);
> > -	ret = btrfs_qgroup_inherit(trans, 0, objectid, inherit);
> > +	ret = btrfs_qgroup_inherit(trans, 0, objectid, root->root_key.objectid, inherit);
> >   	if (ret)
> >   		goto out;
> > diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> > index e3d0630fef0c..6816e01f00b5 100644
> > --- a/fs/btrfs/qgroup.c
> > +++ b/fs/btrfs/qgroup.c
> > @@ -1504,8 +1504,7 @@ static int quick_update_accounting(struct btrfs_fs_info *fs_info,
> >   	return ret;
> >   }
> > -int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src,
> > -			      u64 dst)
> > +int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src, u64 dst)
> >   {
> >   	struct btrfs_fs_info *fs_info = trans->fs_info;
> >   	struct btrfs_qgroup *parent;
> > @@ -2945,6 +2944,42 @@ int btrfs_run_qgroups(struct btrfs_trans_handle *trans)
> >   	return ret;
> >   }
> > +static int qgroup_auto_inherit(struct btrfs_fs_info *fs_info,
> > +			       u64 inode_rootid,
> > +			       struct btrfs_qgroup_inherit **inherit)
> > +{
> > +	int i = 0;
> > +	u64 num_qgroups = 0;
> > +	struct btrfs_qgroup *inode_qg;
> > +	struct btrfs_qgroup_list *qg_list;
> > +
> > +	if (*inherit)
> > +		return -EEXIST;
> > +
> > +	inode_qg = find_qgroup_rb(fs_info, inode_rootid);
> > +	if (!inode_qg)
> > +		return -ENOENT;
> > +
> > +	list_for_each_entry(qg_list, &inode_qg->groups, next_group) {
> > +		++num_qgroups;
> > +	}
> > +
> > +	if (!num_qgroups)
> > +		return 0;
> > +
> > +	*inherit = kzalloc(sizeof(**inherit) + num_qgroups * sizeof(u64), GFP_NOFS);
> > +	if (!*inherit)
> > +		return -ENOMEM;
> > +	(*inherit)->num_qgroups = num_qgroups;
> > +
> > +	list_for_each_entry(qg_list, &inode_qg->groups, next_group) {
> > +		u64 qg_id = qg_list->group->qgroupid;
> > +		*((u64 *)((*inherit)+1) + i) = qg_id;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> >   /*
> >    * Copy the accounting information between qgroups. This is necessary
> >    * when a snapshot or a subvolume is created. Throwing an error will
> > @@ -2952,7 +2987,8 @@ int btrfs_run_qgroups(struct btrfs_trans_handle *trans)
> >    * when a readonly fs is a reasonable outcome.
> >    */
> >   int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
> > -			 u64 objectid, struct btrfs_qgroup_inherit *inherit)
> > +			 u64 objectid, u64 inode_rootid,
> > +			 struct btrfs_qgroup_inherit *inherit)
> >   {
> >   	int ret = 0;
> >   	int i;
> > @@ -2994,6 +3030,9 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
> >   		goto out;
> >   	}
> > +	if (!inherit && btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
> > +		qgroup_auto_inherit(fs_info, inode_rootid, &inherit);
> > +
> >   	if (inherit) {
> >   		i_qgroups = (u64 *)(inherit + 1);
> >   		nums = inherit->num_qgroups + 2 * inherit->num_ref_copies +
> > @@ -3020,6 +3059,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
> >   	if (ret)
> >   		goto out;
> > +
> >   	/*
> >   	 * add qgroup to all inherited groups
> >   	 */
> > diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
> > index b300998dcbc7..aecebe9d0d62 100644
> > --- a/fs/btrfs/qgroup.h
> > +++ b/fs/btrfs/qgroup.h
> > @@ -272,8 +272,7 @@ int btrfs_qgroup_rescan(struct btrfs_fs_info *fs_info);
> >   void btrfs_qgroup_rescan_resume(struct btrfs_fs_info *fs_info);
> >   int btrfs_qgroup_wait_for_completion(struct btrfs_fs_info *fs_info,
> >   				     bool interruptible);
> > -int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src,
> > -			      u64 dst);
> > +int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src, u64 dst);
> >   int btrfs_del_qgroup_relation(struct btrfs_trans_handle *trans, u64 src,
> >   			      u64 dst);
> >   int btrfs_create_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid);
> > @@ -367,7 +366,8 @@ int btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans, u64 bytenr,
> >   int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans);
> >   int btrfs_run_qgroups(struct btrfs_trans_handle *trans);
> >   int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
> > -			 u64 objectid, struct btrfs_qgroup_inherit *inherit);
> > +			 u64 objectid, u64 inode_rootid,
> > +			 struct btrfs_qgroup_inherit *inherit);
> >   void btrfs_qgroup_free_refroot(struct btrfs_fs_info *fs_info,
> >   			       u64 ref_root, u64 num_bytes,
> >   			       enum btrfs_qgroup_rsv_type type);
> > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> > index 0edfb58afd80..6befcf1b4b1f 100644
> > --- a/fs/btrfs/transaction.c
> > +++ b/fs/btrfs/transaction.c
> > @@ -1557,7 +1557,7 @@ static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
> >   	/* Now qgroup are all updated, we can inherit it to new qgroups */
> >   	ret = btrfs_qgroup_inherit(trans, src->root_key.objectid, dst_objectid,
> > -				   inherit);
> > +				   parent->root_key.objectid, inherit);
> >   	if (ret < 0)
> >   		goto out;
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/9] btrfs: track original extent subvol in a new inline ref
  2023-05-04 16:17     ` Boris Burkov
@ 2023-05-04 21:49       ` Qu Wenruo
  2023-05-09 23:58         ` David Sterba
  0 siblings, 1 reply; 23+ messages in thread
From: Qu Wenruo @ 2023-05-04 21:49 UTC (permalink / raw)
  To: Boris Burkov, Qu Wenruo; +Cc: linux-btrfs, kernel-team



On 2023/5/5 00:17, Boris Burkov wrote:
> On Wed, May 03, 2023 at 11:17:12AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2023/5/3 08:59, Boris Burkov wrote:
>>> In order to implement simple quota groups, we need to be able to
>>> associate a data extent with the subvolume that created it. Once you
>>> account for reflink, this information cannot be recovered without
>>> explicitly storing it. Options for storing it are:
>>> - a new key/item
>>> - a new extent inline ref item
>>>
>>> The former is backwards compatible, but wastes space, the latter is
>>> incompat, but is efficient in space and reuses the existing inline ref
>>> machinery, while only abusing it a tiny amount -- specifically, the new
>>> item is not a ref, per-se.
>>
>> Even we introduce new extent tree items, we can still mark the fs compat_ro.
>>
>> As long as we don't do any writes, we can still read the fs without any
>> compatibility problem, and the enable/disable should be addressed by
>> btrfstune/mkfs anyway.
> 
> Unfortunately, I don't believe compat_ro is possible with this design.
> Because of how inline ref items are implemented, there is a lot of code
> that makes assumptions about the extent item size, and the inline ref
> item size based on their type. The best example that definitely breaks
> things rather than maybe just warning is check_extent in tree-checker.c

IIRC if it's compat_ro, older kernel would reject the block group items 
read.

If we expand that behavior to reject the whole extent tree, it can stay 
compat_ro.
Although you may need to do extra backports.

> 
> With a new unparseable ref item inserted in the sequence of refs, that
> code will either overflow or detect padding. The size calculation comes
> up 0, etc. Perhaps there is a clever way to trick it, but I have not
> seen it yet.
> 
> I was able to make it compat_ro by introducing an entirely new item for
> the owner ref, but that comes with a per extent disk usage tradeoff that
> is fairly steep for storing just a single u64.

If it's only to glue the original ref to an extent, I guess a new key 
without an item would be enough.
Although that's still quite expensive.

> 
>>
>> Thanks,
>> Qu
>>>
>>> Signed-off-by: Boris Burkov <boris@bur.io>
>>> ---
>>>    fs/btrfs/accessors.h            |  4 +++
>>>    fs/btrfs/backref.c              |  3 ++
>>>    fs/btrfs/extent-tree.c          | 50 +++++++++++++++++++++++++--------
>>>    fs/btrfs/print-tree.c           | 12 ++++++++
>>>    fs/btrfs/ref-verify.c           |  3 ++
>>>    fs/btrfs/tree-checker.c         |  3 ++
>>>    include/uapi/linux/btrfs_tree.h |  6 ++++
>>>    7 files changed, 70 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
>>> index ceadfc5d6c66..aab61312e4e8 100644
>>> --- a/fs/btrfs/accessors.h
>>> +++ b/fs/btrfs/accessors.h
>>> @@ -350,6 +350,8 @@ BTRFS_SETGET_FUNCS(extent_data_ref_count, struct btrfs_extent_data_ref, count, 3
>>>    BTRFS_SETGET_FUNCS(shared_data_ref_count, struct btrfs_shared_data_ref, count, 32);
>>> +BTRFS_SETGET_FUNCS(extent_owner_ref_root_id, struct btrfs_extent_owner_ref, root_id, 64);
>>> +
>>>    BTRFS_SETGET_FUNCS(extent_inline_ref_type, struct btrfs_extent_inline_ref,
>>>    		   type, 8);
>>>    BTRFS_SETGET_FUNCS(extent_inline_ref_offset, struct btrfs_extent_inline_ref,
>>> @@ -366,6 +368,8 @@ static inline u32 btrfs_extent_inline_ref_size(int type)
>>>    	if (type == BTRFS_EXTENT_DATA_REF_KEY)
>>>    		return sizeof(struct btrfs_extent_data_ref) +
>>>    		       offsetof(struct btrfs_extent_inline_ref, offset);
>>> +	if (type == BTRFS_EXTENT_OWNER_REF_KEY)
>>> +		return sizeof(struct btrfs_extent_inline_ref);
>>>    	return 0;
>>>    }
>>> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
>>> index e54f0884802a..8cd8ed6c572f 100644
>>> --- a/fs/btrfs/backref.c
>>> +++ b/fs/btrfs/backref.c
>>> @@ -1128,6 +1128,9 @@ static int add_inline_refs(struct btrfs_backref_walk_ctx *ctx,
>>>    						       count, sc, GFP_NOFS);
>>>    			break;
>>>    		}
>>> +		case BTRFS_EXTENT_OWNER_REF_KEY:
>>> +			WARN_ON(!btrfs_fs_incompat(ctx->fs_info, SIMPLE_QUOTA));
>>> +			break;
>>>    		default:
>>>    			WARN_ON(1);
>>>    		}
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 5cd289de4e92..b9a2f1e355b7 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -363,9 +363,13 @@ int btrfs_get_extent_inline_ref_type(const struct extent_buffer *eb,
>>>    				     struct btrfs_extent_inline_ref *iref,
>>>    				     enum btrfs_inline_ref_type is_data)
>>>    {
>>> +	struct btrfs_fs_info *fs_info = eb->fs_info;
>>>    	int type = btrfs_extent_inline_ref_type(eb, iref);
>>>    	u64 offset = btrfs_extent_inline_ref_offset(eb, iref);
>>> +	if (type == BTRFS_EXTENT_OWNER_REF_KEY && btrfs_fs_incompat(fs_info, SIMPLE_QUOTA))
>>> +		return type;
>>> +
>>>    	if (type == BTRFS_TREE_BLOCK_REF_KEY ||
>>>    	    type == BTRFS_SHARED_BLOCK_REF_KEY ||
>>>    	    type == BTRFS_SHARED_DATA_REF_KEY ||
>>> @@ -374,26 +378,25 @@ int btrfs_get_extent_inline_ref_type(const struct extent_buffer *eb,
>>>    			if (type == BTRFS_TREE_BLOCK_REF_KEY)
>>>    				return type;
>>>    			if (type == BTRFS_SHARED_BLOCK_REF_KEY) {
>>> -				ASSERT(eb->fs_info);
>>> +				ASSERT(fs_info);
>>>    				/*
>>>    				 * Every shared one has parent tree block,
>>>    				 * which must be aligned to sector size.
>>>    				 */
>>> -				if (offset &&
>>> -				    IS_ALIGNED(offset, eb->fs_info->sectorsize))
>>> +				if (offset && IS_ALIGNED(offset, fs_info->sectorsize))
>>>    					return type;
>>>    			}
>>>    		} else if (is_data == BTRFS_REF_TYPE_DATA) {
>>>    			if (type == BTRFS_EXTENT_DATA_REF_KEY)
>>>    				return type;
>>>    			if (type == BTRFS_SHARED_DATA_REF_KEY) {
>>> -				ASSERT(eb->fs_info);
>>> +				ASSERT(fs_info);
>>>    				/*
>>>    				 * Every shared one has parent tree block,
>>>    				 * which must be aligned to sector size.
>>>    				 */
>>>    				if (offset &&
>>> -				    IS_ALIGNED(offset, eb->fs_info->sectorsize))
>>> +				    IS_ALIGNED(offset, fs_info->sectorsize))
>>>    					return type;
>>>    			}
>>>    		} else {
>>> @@ -403,7 +406,7 @@ int btrfs_get_extent_inline_ref_type(const struct extent_buffer *eb,
>>>    	}
>>>    	btrfs_print_leaf((struct extent_buffer *)eb);
>>> -	btrfs_err(eb->fs_info,
>>> +	btrfs_err(fs_info,
>>>    		  "eb %llu iref 0x%lx invalid extent inline ref type %d",
>>>    		  eb->start, (unsigned long)iref, type);
>>>    	WARN_ON(1);
>>> @@ -912,6 +915,11 @@ int lookup_inline_extent_backref(struct btrfs_trans_handle *trans,
>>>    		}
>>>    		iref = (struct btrfs_extent_inline_ref *)ptr;
>>>    		type = btrfs_get_extent_inline_ref_type(leaf, iref, needed);
>>> +		if (type == BTRFS_EXTENT_OWNER_REF_KEY) {
>>> +			WARN_ON(!btrfs_fs_incompat(fs_info, SIMPLE_QUOTA));
>>> +			ptr += btrfs_extent_inline_ref_size(type);
>>> +			continue;
>>> +		}
>>>    		if (type == BTRFS_REF_TYPE_INVALID) {
>>>    			err = -EUCLEAN;
>>>    			goto out;
>>> @@ -1708,6 +1716,8 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
>>>    		 node->type == BTRFS_SHARED_DATA_REF_KEY)
>>>    		ret = run_delayed_data_ref(trans, node, extent_op,
>>>    					   insert_reserved);
>>> +	else if (node->type == BTRFS_EXTENT_OWNER_REF_KEY)
>>> +		ret = 0;
>>>    	else
>>>    		BUG();
>>>    	if (ret && insert_reserved)
>>> @@ -2275,6 +2285,7 @@ static noinline int check_committed_ref(struct btrfs_root *root,
>>>    	struct btrfs_extent_item *ei;
>>>    	struct btrfs_key key;
>>>    	u32 item_size;
>>> +	u32 expected_size;
>>>    	int type;
>>>    	int ret;
>>> @@ -2301,10 +2312,17 @@ static noinline int check_committed_ref(struct btrfs_root *root,
>>>    	ret = 1;
>>>    	item_size = btrfs_item_size(leaf, path->slots[0]);
>>>    	ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item);
>>> +	expected_size = sizeof(*ei) + btrfs_extent_inline_ref_size(BTRFS_EXTENT_DATA_REF_KEY);
>>> +
>>> +	iref = (struct btrfs_extent_inline_ref *)(ei + 1);
>>> +	type = btrfs_get_extent_inline_ref_type(leaf, iref, BTRFS_REF_TYPE_DATA);
>>> +	if (btrfs_fs_incompat(fs_info, SIMPLE_QUOTA) && type == BTRFS_EXTENT_OWNER_REF_KEY) {
>>> +		expected_size += btrfs_extent_inline_ref_size(BTRFS_EXTENT_OWNER_REF_KEY);
>>> +		iref = (struct btrfs_extent_inline_ref *)(iref + 1);
>>> +	}
>>>    	/* If extent item has more than 1 inline ref then it's shared */
>>> -	if (item_size != sizeof(*ei) +
>>> -	    btrfs_extent_inline_ref_size(BTRFS_EXTENT_DATA_REF_KEY))
>>> +	if (item_size != expected_size)
>>>    		goto out;
>>>    	/*
>>> @@ -2316,8 +2334,6 @@ static noinline int check_committed_ref(struct btrfs_root *root,
>>>    	     btrfs_root_last_snapshot(&root->root_item)))
>>>    		goto out;
>>> -	iref = (struct btrfs_extent_inline_ref *)(ei + 1);
>>> -
>>>    	/* If this extent has SHARED_DATA_REF then it's shared */
>>>    	type = btrfs_get_extent_inline_ref_type(leaf, iref, BTRFS_REF_TYPE_DATA);
>>>    	if (type != BTRFS_EXTENT_DATA_REF_KEY)
>>> @@ -4572,6 +4588,7 @@ static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
>>>    	struct btrfs_root *extent_root;
>>>    	int ret;
>>>    	struct btrfs_extent_item *extent_item;
>>> +	struct btrfs_extent_owner_ref *oref;
>>>    	struct btrfs_extent_inline_ref *iref;
>>>    	struct btrfs_path *path;
>>>    	struct extent_buffer *leaf;
>>> @@ -4583,7 +4600,10 @@ static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
>>>    	else
>>>    		type = BTRFS_EXTENT_DATA_REF_KEY;
>>> -	size = sizeof(*extent_item) + btrfs_extent_inline_ref_size(type);
>>> +	size = sizeof(*extent_item);
>>> +	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
>>> +		size += btrfs_extent_inline_ref_size(BTRFS_EXTENT_OWNER_REF_KEY);
>>> +	size += btrfs_extent_inline_ref_size(type);
>>>    	path = btrfs_alloc_path();
>>>    	if (!path)
>>> @@ -4604,8 +4624,16 @@ static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
>>>    	btrfs_set_extent_flags(leaf, extent_item,
>>>    			       flags | BTRFS_EXTENT_FLAG_DATA);
>>> +
>>>    	iref = (struct btrfs_extent_inline_ref *)(extent_item + 1);
>>> +	if (btrfs_fs_incompat(fs_info, SIMPLE_QUOTA)) {
>>> +		btrfs_set_extent_inline_ref_type(leaf, iref, BTRFS_EXTENT_OWNER_REF_KEY);
>>> +		oref = (struct btrfs_extent_owner_ref *)(&iref->offset);
>>> +		btrfs_set_extent_owner_ref_root_id(leaf, oref, root_objectid);
>>> +		iref = (struct btrfs_extent_inline_ref *)(oref + 1);
>>> +	}
>>>    	btrfs_set_extent_inline_ref_type(leaf, iref, type);
>>> +
>>>    	if (parent > 0) {
>>>    		struct btrfs_shared_data_ref *ref;
>>>    		ref = (struct btrfs_shared_data_ref *)(iref + 1);
>>> diff --git a/fs/btrfs/print-tree.c b/fs/btrfs/print-tree.c
>>> index b93c96213304..1114cd915bd8 100644
>>> --- a/fs/btrfs/print-tree.c
>>> +++ b/fs/btrfs/print-tree.c
>>> @@ -80,12 +80,20 @@ static void print_extent_data_ref(struct extent_buffer *eb,
>>>    	       btrfs_extent_data_ref_count(eb, ref));
>>>    }
>>> +static void print_extent_owner_ref(struct extent_buffer *eb,
>>> +				   struct btrfs_extent_owner_ref *ref)
>>> +{
>>> +	WARN_ON(!btrfs_fs_incompat(eb->fs_info, SIMPLE_QUOTA));
>>> +	pr_cont("extent data owner root %llu\n", btrfs_extent_owner_ref_root_id(eb, ref));
>>> +}
>>> +
>>>    static void print_extent_item(struct extent_buffer *eb, int slot, int type)
>>>    {
>>>    	struct btrfs_extent_item *ei;
>>>    	struct btrfs_extent_inline_ref *iref;
>>>    	struct btrfs_extent_data_ref *dref;
>>>    	struct btrfs_shared_data_ref *sref;
>>> +	struct btrfs_extent_owner_ref *oref;
>>>    	struct btrfs_disk_key key;
>>>    	unsigned long end;
>>>    	unsigned long ptr;
>>> @@ -159,6 +167,10 @@ static void print_extent_item(struct extent_buffer *eb, int slot, int type)
>>>    			"\t\t\t(parent %llu not aligned to sectorsize %u)\n",
>>>    				     offset, eb->fs_info->sectorsize);
>>>    			break;
>>> +		case BTRFS_EXTENT_OWNER_REF_KEY:
>>> +			oref = (struct btrfs_extent_owner_ref *)(&iref->offset);
>>> +			print_extent_owner_ref(eb, oref);
>>> +			break;
>>>    		default:
>>>    			pr_cont("(extent %llu has INVALID ref type %d)\n",
>>>    				  eb->start, type);
>>> diff --git a/fs/btrfs/ref-verify.c b/fs/btrfs/ref-verify.c
>>> index 95d28497de7c..9edc87eaff1f 100644
>>> --- a/fs/btrfs/ref-verify.c
>>> +++ b/fs/btrfs/ref-verify.c
>>> @@ -485,6 +485,9 @@ static int process_extent_item(struct btrfs_fs_info *fs_info,
>>>    			ret = add_shared_data_ref(fs_info, offset, count,
>>>    						  key->objectid, key->offset);
>>>    			break;
>>> +		case BTRFS_EXTENT_OWNER_REF_KEY:
>>> +			WARN_ON(!btrfs_fs_incompat(fs_info, SIMPLE_QUOTA));
>>> +			break;
>>>    		default:
>>>    			btrfs_err(fs_info, "invalid key type in iref");
>>>    			ret = -EINVAL;
>>> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
>>> index e2b54793bf0c..27d4230a38a8 100644
>>> --- a/fs/btrfs/tree-checker.c
>>> +++ b/fs/btrfs/tree-checker.c
>>> @@ -1451,6 +1451,9 @@ static int check_extent_item(struct extent_buffer *leaf,
>>>    			}
>>>    			inline_refs += btrfs_shared_data_ref_count(leaf, sref);
>>>    			break;
>>> +		case BTRFS_EXTENT_OWNER_REF_KEY:
>>> +			WARN_ON(!btrfs_fs_incompat(fs_info, SIMPLE_QUOTA));
>>> +			break;
>>>    		default:
>>>    			extent_err(leaf, slot, "unknown inline ref type: %u",
>>>    				   inline_type);
>>> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
>>> index ab38d0f411fa..424c7f342712 100644
>>> --- a/include/uapi/linux/btrfs_tree.h
>>> +++ b/include/uapi/linux/btrfs_tree.h
>>> @@ -226,6 +226,8 @@
>>>    #define BTRFS_SHARED_DATA_REF_KEY	184
>>> +#define BTRFS_EXTENT_OWNER_REF_KEY	190
>>> +
>>>    /*
>>>     * block groups give us hints into the extent allocation trees.  Which
>>>     * blocks are free etc etc
>>> @@ -783,6 +785,10 @@ struct btrfs_shared_data_ref {
>>>    	__le32 count;
>>>    } __attribute__ ((__packed__));
>>> +struct btrfs_extent_owner_ref {
>>> +	u64 root_id;
>>> +} __attribute__ ((__packed__));
>>> +
>>>    struct btrfs_extent_inline_ref {
>>>    	__u8 type;
>>>    	__le64 offset;
>>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 6/9] btrfs: auto hierarchy for simple qgroups of nested subvols
  2023-05-04 16:34     ` Boris Burkov
@ 2023-05-04 21:55       ` Qu Wenruo
  0 siblings, 0 replies; 23+ messages in thread
From: Qu Wenruo @ 2023-05-04 21:55 UTC (permalink / raw)
  To: Boris Burkov, Qu Wenruo; +Cc: linux-btrfs, kernel-team



On 2023/5/5 00:34, Boris Burkov wrote:
> On Wed, May 03, 2023 at 12:47:40PM +0800, Qu Wenruo wrote:
>>
>>
>> On 2023/5/3 08:59, Boris Burkov wrote:
>>> Consider the following sequence:
>>> - enable quotas
>>> - create subvol S id 256 at dir outer/
>>> - create a qgroup 1/100
>>> - add 0/256 (S's auto qgroup) to 1/100
>>> - create subvol T id 257 at dir outer/inner/
>>>
>>> With full qgroups, there is no relationship between 0/257 and either of
>>> 0/256 or 1/100. There is an inherit feature that the creator of inner/
>>> can use to specify it ought to be in 1/100.
>>>
>>> Simple quotas are targeted at container isolation, where such automatic
>>> inheritance for not necessarily trusted/controlled nested subvol
>>> creation would be quite helpful. Therefore, add a new default behavior
>>> for simple quotas: when you create a nested subvol, automatically
>>> inherit as parents any parents of the qgroup of the subvol the new inode
>>> is going in.
>>>
>>> In our example, 257/0 would also be under 1/100, allowing easy control
>>> of a total quota over an arbitrary hierarchy of subvolumes.
>>>
>>> I think this _might_ be a generally useful behavior, so it could be
>>> interesting to put it behind a new inheritance flag that simple quotas
>>> always use while traditional quotas let the user specify, but this is a
>>> minimally intrusive change to start.
>>
>> I'm not sure if the automatically qgroup inherit is a good idea.
>>
>> Consider the following case, a subvolume is created under another subvolume,
>> then it got inherited into qgroup 1/0.
> 
> I am not sure I follow this concern. We automatically inherit
> relationships of the inode containing subvolume, not its qgroup.
> 
> So if X/0 is in a different qgroup Q/1 and subvol Y is created with its
> inode in X, Y/0 will be in Q/1, not X/0. I don't believe we would ever
> introduce a dependency that would violate the level invariants.
> 
>>
>> But don't forget that a subvolume can be easily moved to other directory,
>> should we also change which qgroup it belongs to?
> 
> I think this is a great point and I haven't spent enough time thinking
> about it. Currently, the answer is no, but that might be surprising to a
> user.
> 
> I imagine it gets even worse if you mount the same subvolume multiple times..
> 
>>
>>
>> Another point is, if a container manager tool wants to do the same inherit
>> behavior, it's not that hard to query the owner group of the parent
>> directory, then pass "-i" option for "btrfs subvolume create" or "btrfs
>> subvolume snapshot" commands.
>>
>> As the old saying goes, kernel should only provide a mechanism, not a
>> policy. To me, automatically inherit qgroup owner looks more like a policy.
> 
> This is targeted at the case where the container management tool does
> not own creating a subvol.
> 
> If we did it your way, then the second something running in a container
> creates a subvol (a reasonable operation, for faster bulk deletes, e.g.),
> they would trivially "escape" their disk limit. At that point you would
> have to do convoluted things like LD_PRELOAD or convincing all users to
> use a subvol creation util/lib (might be running some code we can't or
> don't want to modify too..)

Oh, I totally missed such situation.

Indeed we can not limits the programs inside the container, which is 
completely unaware of the new simple quota situation.

Then the automatic inherit makes more sense, and it may also bring new 
concerns, like if some one wants to disable the automatic inherit 
behavior, but how do we avoid such disable inside the container.

I'd say when container is involved, nothing is going to be simple anymore...

Thanks,
Qu
> 
> If we can do it automatically in the kernel, there really is value
> there.
> 
> IMO, a useful analogy here is cgroups. When you put a process in a
> cgroup, all its child processes automatically get rolled up into the
> same cgroup, which is important for the same reasons as what I described
> above. You can then do weird stuff and move procs to a different cgroup,
> but by default everything is neatly nested, which I think is a good
> behavior for simple quotas too.
> 
> If the functionality is not totally unpalatable, but it's too confusing
> to make it default for only simple quotas, I can add an ioctl or subvol
> creation flag or something that opts a subvol in to having its nested
> subvols inherit its parent qgroups. That would be something the
> container runtime could use to opt in.
> 
> Thanks for the review,
> Boris
> 
>>
>> Thanks,
>> Qu
>>
>>>
>>> Signed-off-by: Boris Burkov <boris@bur.io>
>>> ---
>>>    fs/btrfs/ioctl.c       |  2 +-
>>>    fs/btrfs/qgroup.c      | 46 +++++++++++++++++++++++++++++++++++++++---
>>>    fs/btrfs/qgroup.h      |  6 +++---
>>>    fs/btrfs/transaction.c |  2 +-
>>>    4 files changed, 48 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>>> index ca7d2ef739c8..4d6d28feb5c6 100644
>>> --- a/fs/btrfs/ioctl.c
>>> +++ b/fs/btrfs/ioctl.c
>>> @@ -652,7 +652,7 @@ static noinline int create_subvol(struct mnt_idmap *idmap,
>>>    	/* Tree log can't currently deal with an inode which is a new root. */
>>>    	btrfs_set_log_full_commit(trans);
>>> -	ret = btrfs_qgroup_inherit(trans, 0, objectid, inherit);
>>> +	ret = btrfs_qgroup_inherit(trans, 0, objectid, root->root_key.objectid, inherit);
>>>    	if (ret)
>>>    		goto out;
>>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
>>> index e3d0630fef0c..6816e01f00b5 100644
>>> --- a/fs/btrfs/qgroup.c
>>> +++ b/fs/btrfs/qgroup.c
>>> @@ -1504,8 +1504,7 @@ static int quick_update_accounting(struct btrfs_fs_info *fs_info,
>>>    	return ret;
>>>    }
>>> -int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src,
>>> -			      u64 dst)
>>> +int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src, u64 dst)
>>>    {
>>>    	struct btrfs_fs_info *fs_info = trans->fs_info;
>>>    	struct btrfs_qgroup *parent;
>>> @@ -2945,6 +2944,42 @@ int btrfs_run_qgroups(struct btrfs_trans_handle *trans)
>>>    	return ret;
>>>    }
>>> +static int qgroup_auto_inherit(struct btrfs_fs_info *fs_info,
>>> +			       u64 inode_rootid,
>>> +			       struct btrfs_qgroup_inherit **inherit)
>>> +{
>>> +	int i = 0;
>>> +	u64 num_qgroups = 0;
>>> +	struct btrfs_qgroup *inode_qg;
>>> +	struct btrfs_qgroup_list *qg_list;
>>> +
>>> +	if (*inherit)
>>> +		return -EEXIST;
>>> +
>>> +	inode_qg = find_qgroup_rb(fs_info, inode_rootid);
>>> +	if (!inode_qg)
>>> +		return -ENOENT;
>>> +
>>> +	list_for_each_entry(qg_list, &inode_qg->groups, next_group) {
>>> +		++num_qgroups;
>>> +	}
>>> +
>>> +	if (!num_qgroups)
>>> +		return 0;
>>> +
>>> +	*inherit = kzalloc(sizeof(**inherit) + num_qgroups * sizeof(u64), GFP_NOFS);
>>> +	if (!*inherit)
>>> +		return -ENOMEM;
>>> +	(*inherit)->num_qgroups = num_qgroups;
>>> +
>>> +	list_for_each_entry(qg_list, &inode_qg->groups, next_group) {
>>> +		u64 qg_id = qg_list->group->qgroupid;
>>> +		*((u64 *)((*inherit)+1) + i) = qg_id;
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>>> +
>>>    /*
>>>     * Copy the accounting information between qgroups. This is necessary
>>>     * when a snapshot or a subvolume is created. Throwing an error will
>>> @@ -2952,7 +2987,8 @@ int btrfs_run_qgroups(struct btrfs_trans_handle *trans)
>>>     * when a readonly fs is a reasonable outcome.
>>>     */
>>>    int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
>>> -			 u64 objectid, struct btrfs_qgroup_inherit *inherit)
>>> +			 u64 objectid, u64 inode_rootid,
>>> +			 struct btrfs_qgroup_inherit *inherit)
>>>    {
>>>    	int ret = 0;
>>>    	int i;
>>> @@ -2994,6 +3030,9 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
>>>    		goto out;
>>>    	}
>>> +	if (!inherit && btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
>>> +		qgroup_auto_inherit(fs_info, inode_rootid, &inherit);
>>> +
>>>    	if (inherit) {
>>>    		i_qgroups = (u64 *)(inherit + 1);
>>>    		nums = inherit->num_qgroups + 2 * inherit->num_ref_copies +
>>> @@ -3020,6 +3059,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
>>>    	if (ret)
>>>    		goto out;
>>> +
>>>    	/*
>>>    	 * add qgroup to all inherited groups
>>>    	 */
>>> diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
>>> index b300998dcbc7..aecebe9d0d62 100644
>>> --- a/fs/btrfs/qgroup.h
>>> +++ b/fs/btrfs/qgroup.h
>>> @@ -272,8 +272,7 @@ int btrfs_qgroup_rescan(struct btrfs_fs_info *fs_info);
>>>    void btrfs_qgroup_rescan_resume(struct btrfs_fs_info *fs_info);
>>>    int btrfs_qgroup_wait_for_completion(struct btrfs_fs_info *fs_info,
>>>    				     bool interruptible);
>>> -int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src,
>>> -			      u64 dst);
>>> +int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src, u64 dst);
>>>    int btrfs_del_qgroup_relation(struct btrfs_trans_handle *trans, u64 src,
>>>    			      u64 dst);
>>>    int btrfs_create_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid);
>>> @@ -367,7 +366,8 @@ int btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans, u64 bytenr,
>>>    int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans);
>>>    int btrfs_run_qgroups(struct btrfs_trans_handle *trans);
>>>    int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
>>> -			 u64 objectid, struct btrfs_qgroup_inherit *inherit);
>>> +			 u64 objectid, u64 inode_rootid,
>>> +			 struct btrfs_qgroup_inherit *inherit);
>>>    void btrfs_qgroup_free_refroot(struct btrfs_fs_info *fs_info,
>>>    			       u64 ref_root, u64 num_bytes,
>>>    			       enum btrfs_qgroup_rsv_type type);
>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>>> index 0edfb58afd80..6befcf1b4b1f 100644
>>> --- a/fs/btrfs/transaction.c
>>> +++ b/fs/btrfs/transaction.c
>>> @@ -1557,7 +1557,7 @@ static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
>>>    	/* Now qgroup are all updated, we can inherit it to new qgroups */
>>>    	ret = btrfs_qgroup_inherit(trans, src->root_key.objectid, dst_objectid,
>>> -				   inherit);
>>> +				   parent->root_key.objectid, inherit);
>>>    	if (ret < 0)
>>>    		goto out;
>>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/9] btrfs: simple quotas
  2023-05-03  0:59 [PATCH RFC 0/9] btrfs: simple quotas Boris Burkov
                   ` (8 preceding siblings ...)
  2023-05-03  0:59 ` [PATCH 9/9] btrfs: free qgroup rsv on io failure Boris Burkov
@ 2023-05-05  4:13 ` Anand Jain
  2023-05-10  1:09   ` Boris Burkov
  9 siblings, 1 reply; 23+ messages in thread
From: Anand Jain @ 2023-05-05  4:13 UTC (permalink / raw)
  To: Boris Burkov, linux-btrfs, kernel-team

On 3/5/23 08:59, Boris Burkov wrote:
> btrfs quota groups (qgroups) are a compelling feature of btrfs that
> allow flexible control for limiting subvolume data and metadata usage.
> However, due to btrfs's high level decision to tradeoff snapshot
> performance against ref-counting performance, qgroups suffer from
> non-trivial performance issues that make them unattractive in certain
> workloads. Particularly, frequent backref walking during writes and
> during commits can make operations increasingly expensive as the number
> of snapshots scales up. For that reason, we have never been able to
> commit to using qgroups in production at Meta, despite significant
> interest from people running container workloads, where we would benefit
> from protecting the rest of the host from a buggy application in a
> container running away with disk usage.
> 
> This patch series introduces a simplified version of qgroups called
> simple quotas (squotas) which never computes global reference counts
> for extents, and thus has similar performance characteristics to normal,
> quotas disabled, btrfs. The "trick" is that in simple quotas mode, we
> account all extents permanently to the subvolume in which they were
> originally created. That allows us to make all accounting 1:1 with
> extent item lifetime, removing the need to walk backrefs. However, this
> sacrifices the ability to compute shared vs. exclusive usage. It also
> results in counter-intuitive, though still predictable and simple,
> accounting in the cases where an original extent is removed while a
> shared copy still exists. Qgroups is able to detect that case and count
> the remaining copy as an exclusive owner, while squotas is not. As a
> result, squotas works best when the original extent is immutable and
> outlives any clones.
> 
> In order to track the original creating subvolume of a data extent in
> the face of reflinks, it is necessary to add additional accounting to
> the extent item. To save space, this is done with a new inline ref item.
> However, the downside of this approach is that it makes enabling squota
> an incompat change, denoted by the new incompat bit SIMPLE_QUOTA. When
> this bit is set and quotas are enabled, new extent items get the extra
> accounting, and freed extent items check for the accounting to find
> their creating subvolume.
> 
> Squotas reuses the api of qgroups. The only difference is that when you
> enable quotas via `btrfs quota enable`, you pass the `--simple` flag.
> Squotas will always report exclusive == shared for each qgroup.
> 
> This is still a preliminary RFC patch series, so not all the ducks are
> fully in a row. In particular, some userspace parts are missing, like
> meaningful integration with fsck, which will drive further testing.
> 
> My current branches for btrfs-progs and fstests do contain some (sloppy)
> minimal support needed to run and test the feature:
> btrfs-progs: https://github.com/boryas/btrfs-progs/tree/squota-progs
> fstests: https://github.com/boryas/fstests/tree/squota-test
> 
> Current testing methodology:
> - New fstest (btrfs/400 in the squota-test branch)
> - Run all fstests with squota enabled at mkfs time. Not all tests are
>    passing in this regime, though this is actually true of qgroups as
>    well. Most of the issues have to do with leaking reserved space in
>    less commonly tested cases like I/O failures. My intent is to get this
>    test suite fully passing.
> - Run all fstests without squota enabled at mkfs time
> 
> Basic performance test:

> In this test, I ran a workload which generated K files in a subvolume,
> then took L snapshots of that subvolume, then unshared each file in
> each subvolume.

Can you pls provide a link to the test script?
I couldn't find it in the links mentioned above.

Thanks, Anand

  The measurement is just total walltime. K is the row
> index and L the column index, so in these tables, we vary between 1
> and 100 files and 1 and 10000 snapshots. The "n" table is no quotas,
> the "q" table is qgroups and the "s" table is squotas. As you can see,
> "n" and "s" are quite similar, while "q" falls of a cliff as the
> number of snapshots increases. More sophisticated and realistic
> performance testing that doesn't abuse such an insane number of
> snapshots is still to come.
> 
> n
>          1       10      100     1000    10000
> 1       0.18    0.24    1.58    16.49   211.34
> 10      0.28    0.43    2.80    29.74   324.70
> 100     0.55    0.99    6.57    65.13   717.51
> 
> q
>          1       10      100     1000    10000
> 1       2.19    0.35    2.32    25.78   756.62
> 10      0.34    0.48    3.24    68.72   3731.73
> 100     0.64    0.80    7.63    215.54  14170.73
> 
> s
>          1       10      100     1000    10000
> 1       2.19    0.32    1.83    19.19   231.75
> 10      0.31    0.43    2.86    28.86   351.42
> 100     0.70    0.90    6.75    67.89   742.93
> 
> 
> Boris Burkov (9):
>    btrfs: simple quotas mode
>    btrfs: new function for recording simple quota usage
>    btrfs: track original extent subvol in a new inline ref
>    btrfs: track metadata owning root in delayed refs
>    btrfs: record simple quota deltas
>    btrfs: auto hierarchy for simple qgroups of nested subvols
>    btrfs: check generation when recording simple quota delta
>    btrfs: expose the qgroup mode via sysfs
>    btrfs: free qgroup rsv on io failure
> 
>   fs/btrfs/accessors.h            |   6 +
>   fs/btrfs/backref.c              |   3 +
>   fs/btrfs/delayed-ref.c          |  13 +-
>   fs/btrfs/delayed-ref.h          |  28 ++++-
>   fs/btrfs/extent-tree.c          | 143 +++++++++++++++++----
>   fs/btrfs/fs.h                   |   7 +-
>   fs/btrfs/ioctl.c                |   4 +-
>   fs/btrfs/ordered-data.c         |   6 +-
>   fs/btrfs/print-tree.c           |  12 ++
>   fs/btrfs/qgroup.c               | 216 +++++++++++++++++++++++++++++---
>   fs/btrfs/qgroup.h               |  29 ++++-
>   fs/btrfs/ref-verify.c           |   3 +
>   fs/btrfs/relocation.c           |  11 +-
>   fs/btrfs/sysfs.c                |  26 ++++
>   fs/btrfs/transaction.c          |  11 +-
>   fs/btrfs/tree-checker.c         |   3 +
>   include/uapi/linux/btrfs.h      |   1 +
>   include/uapi/linux/btrfs_tree.h |  13 ++
>   18 files changed, 471 insertions(+), 64 deletions(-)
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/9] btrfs: track original extent subvol in a new inline ref
  2023-05-04 21:49       ` Qu Wenruo
@ 2023-05-09 23:58         ` David Sterba
  0 siblings, 0 replies; 23+ messages in thread
From: David Sterba @ 2023-05-09 23:58 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Boris Burkov, Qu Wenruo, linux-btrfs, kernel-team

On Fri, May 05, 2023 at 05:49:00AM +0800, Qu Wenruo wrote:
> On 2023/5/5 00:17, Boris Burkov wrote:
> > On Wed, May 03, 2023 at 11:17:12AM +0800, Qu Wenruo wrote:
> >> On 2023/5/3 08:59, Boris Burkov wrote:
> >>> In order to implement simple quota groups, we need to be able to
> >>> associate a data extent with the subvolume that created it. Once you
> >>> account for reflink, this information cannot be recovered without
> >>> explicitly storing it. Options for storing it are:
> >>> - a new key/item
> >>> - a new extent inline ref item
> >>>
> >>> The former is backwards compatible, but wastes space, the latter is
> >>> incompat, but is efficient in space and reuses the existing inline ref
> >>> machinery, while only abusing it a tiny amount -- specifically, the new
> >>> item is not a ref, per-se.
> >>
> >> Even we introduce new extent tree items, we can still mark the fs compat_ro.
> >>
> >> As long as we don't do any writes, we can still read the fs without any
> >> compatibility problem, and the enable/disable should be addressed by
> >> btrfstune/mkfs anyway.
> > 
> > Unfortunately, I don't believe compat_ro is possible with this design.
> > Because of how inline ref items are implemented, there is a lot of code
> > that makes assumptions about the extent item size, and the inline ref
> > item size based on their type. The best example that definitely breaks
> > things rather than maybe just warning is check_extent in tree-checker.c
> 
> IIRC if it's compat_ro, older kernel would reject the block group items 
> read.
> 
> If we expand that behavior to reject the whole extent tree, it can stay 
> compat_ro.
> Although you may need to do extra backports.
> 
> > 
> > With a new unparseable ref item inserted in the sequence of refs, that
> > code will either overflow or detect padding. The size calculation comes
> > up 0, etc. Perhaps there is a clever way to trick it, but I have not
> > seen it yet.
> > 
> > I was able to make it compat_ro by introducing an entirely new item for
> > the owner ref, but that comes with a per extent disk usage tradeoff that
> > is fairly steep for storing just a single u64.
> 
> If it's only to glue the original ref to an extent, I guess a new key 
> without an item would be enough.
> Although that's still quite expensive.

I consider allocating a new key as a high cost, it's worth for new
feature like verity or encryption where we require a fine grained
tracking of some new information. The number space is 255 values wide
and there are some ranges that are relatively ordered so the placement
in the logical b-tree space is good. We still have enough free values
but the gaps get smaller each time so I'd rather consider other options
first.

One drawback with features defined by keys is that it can't be easily
seen from superblock if the feature is present or not. Like extended
refs, no holes, lzo/zstd compressed extents. We always need to add the
compat bit. In case we would add a new key just to store little data and
still need to add the incompat bit it's time to think again if we could
get away with just the incompat bit. With some loss of backward
compatibility.

Right now I don't know what would be the best way forward but I'm
leaning more towards less backward compatibility and saving space in
structures. We get new incompat features "regularly" and people move to
newer kernels eventually after some period where we have time to iron
out bugs and explore the use case.

The simple quotas should fill the gap that qgroups can't so it makes
sense and people have been asking for something like that.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 0/9] btrfs: simple quotas
  2023-05-05  4:13 ` [PATCH RFC 0/9] btrfs: simple quotas Anand Jain
@ 2023-05-10  1:09   ` Boris Burkov
  0 siblings, 0 replies; 23+ messages in thread
From: Boris Burkov @ 2023-05-10  1:09 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs, kernel-team

On Fri, May 05, 2023 at 12:13:58PM +0800, Anand Jain wrote:
> On 3/5/23 08:59, Boris Burkov wrote:
> > btrfs quota groups (qgroups) are a compelling feature of btrfs that
> > allow flexible control for limiting subvolume data and metadata usage.
> > However, due to btrfs's high level decision to tradeoff snapshot
> > performance against ref-counting performance, qgroups suffer from
> > non-trivial performance issues that make them unattractive in certain
> > workloads. Particularly, frequent backref walking during writes and
> > during commits can make operations increasingly expensive as the number
> > of snapshots scales up. For that reason, we have never been able to
> > commit to using qgroups in production at Meta, despite significant
> > interest from people running container workloads, where we would benefit
> > from protecting the rest of the host from a buggy application in a
> > container running away with disk usage.
> > 
> > This patch series introduces a simplified version of qgroups called
> > simple quotas (squotas) which never computes global reference counts
> > for extents, and thus has similar performance characteristics to normal,
> > quotas disabled, btrfs. The "trick" is that in simple quotas mode, we
> > account all extents permanently to the subvolume in which they were
> > originally created. That allows us to make all accounting 1:1 with
> > extent item lifetime, removing the need to walk backrefs. However, this
> > sacrifices the ability to compute shared vs. exclusive usage. It also
> > results in counter-intuitive, though still predictable and simple,
> > accounting in the cases where an original extent is removed while a
> > shared copy still exists. Qgroups is able to detect that case and count
> > the remaining copy as an exclusive owner, while squotas is not. As a
> > result, squotas works best when the original extent is immutable and
> > outlives any clones.
> > 
> > In order to track the original creating subvolume of a data extent in
> > the face of reflinks, it is necessary to add additional accounting to
> > the extent item. To save space, this is done with a new inline ref item.
> > However, the downside of this approach is that it makes enabling squota
> > an incompat change, denoted by the new incompat bit SIMPLE_QUOTA. When
> > this bit is set and quotas are enabled, new extent items get the extra
> > accounting, and freed extent items check for the accounting to find
> > their creating subvolume.
> > 
> > Squotas reuses the api of qgroups. The only difference is that when you
> > enable quotas via `btrfs quota enable`, you pass the `--simple` flag.
> > Squotas will always report exclusive == shared for each qgroup.
> > 
> > This is still a preliminary RFC patch series, so not all the ducks are
> > fully in a row. In particular, some userspace parts are missing, like
> > meaningful integration with fsck, which will drive further testing.
> > 
> > My current branches for btrfs-progs and fstests do contain some (sloppy)
> > minimal support needed to run and test the feature:
> > btrfs-progs: https://github.com/boryas/btrfs-progs/tree/squota-progs
> > fstests: https://github.com/boryas/fstests/tree/squota-test
> > 
> > Current testing methodology:
> > - New fstest (btrfs/400 in the squota-test branch)
> > - Run all fstests with squota enabled at mkfs time. Not all tests are
> >    passing in this regime, though this is actually true of qgroups as
> >    well. Most of the issues have to do with leaking reserved space in
> >    less commonly tested cases like I/O failures. My intent is to get this
> >    test suite fully passing.
> > - Run all fstests without squota enabled at mkfs time
> > 
> > Basic performance test:
> 
> > In this test, I ran a workload which generated K files in a subvolume,
> > then took L snapshots of that subvolume, then unshared each file in
> > each subvolume.
> 
> Can you pls provide a link to the test script?
> I couldn't find it in the links mentioned above.
> 
> Thanks, Anand

Hey Anand,

Sorry I missed this message and haven't replied yet.

The reason I didn't share the scripts is that they are sort of spread
across a couple different places and I haven't collected them into a
shareable form (probably in fsperf, eventually)

The repo which has some scripts to scale snapshots, extents and files:
https://github.com/boryas/scripts/tree/main/sh/btrfs-scale
(which depends on my silly personal shell script infra to really
run... https://github.com/boryas/clitools)
and then I don't currently have the script that puts it all together,
(forgot to push and its on a diff computer but I'm at LSFMMBPF...)
but IIRC it does something like:

mkfs on an nvme; mount it
make a subvol
put K files into the subvol with the files scaling script
make a snapshot dir
make L identical snapshots in that dir (no snapshots of snapshots)
touch 4k of each of the 8k files with dd to unshare

If you'd like to play around with the workload before then, but this
isn't enough info, let me know and I'll share more with you ASAP. I
think you should be able to slap something together with this, though.

TL;DR: it's a dumb script that was just a PoC to convince me the basic
scaling was what I expected. But I promise to better share my
methodology with the next time I send the patch series.

Thanks for your interest!

Boris

> 
>  The measurement is just total walltime. K is the row
> > index and L the column index, so in these tables, we vary between 1
> > and 100 files and 1 and 10000 snapshots. The "n" table is no quotas,
> > the "q" table is qgroups and the "s" table is squotas. As you can see,
> > "n" and "s" are quite similar, while "q" falls of a cliff as the
> > number of snapshots increases. More sophisticated and realistic
> > performance testing that doesn't abuse such an insane number of
> > snapshots is still to come.
> > 
> > n
> >          1       10      100     1000    10000
> > 1       0.18    0.24    1.58    16.49   211.34
> > 10      0.28    0.43    2.80    29.74   324.70
> > 100     0.55    0.99    6.57    65.13   717.51
> > 
> > q
> >          1       10      100     1000    10000
> > 1       2.19    0.35    2.32    25.78   756.62
> > 10      0.34    0.48    3.24    68.72   3731.73
> > 100     0.64    0.80    7.63    215.54  14170.73
> > 
> > s
> >          1       10      100     1000    10000
> > 1       2.19    0.32    1.83    19.19   231.75
> > 10      0.31    0.43    2.86    28.86   351.42
> > 100     0.70    0.90    6.75    67.89   742.93
> > 
> > 
> > Boris Burkov (9):
> >    btrfs: simple quotas mode
> >    btrfs: new function for recording simple quota usage
> >    btrfs: track original extent subvol in a new inline ref
> >    btrfs: track metadata owning root in delayed refs
> >    btrfs: record simple quota deltas
> >    btrfs: auto hierarchy for simple qgroups of nested subvols
> >    btrfs: check generation when recording simple quota delta
> >    btrfs: expose the qgroup mode via sysfs
> >    btrfs: free qgroup rsv on io failure
> > 
> >   fs/btrfs/accessors.h            |   6 +
> >   fs/btrfs/backref.c              |   3 +
> >   fs/btrfs/delayed-ref.c          |  13 +-
> >   fs/btrfs/delayed-ref.h          |  28 ++++-
> >   fs/btrfs/extent-tree.c          | 143 +++++++++++++++++----
> >   fs/btrfs/fs.h                   |   7 +-
> >   fs/btrfs/ioctl.c                |   4 +-
> >   fs/btrfs/ordered-data.c         |   6 +-
> >   fs/btrfs/print-tree.c           |  12 ++
> >   fs/btrfs/qgroup.c               | 216 +++++++++++++++++++++++++++++---
> >   fs/btrfs/qgroup.h               |  29 ++++-
> >   fs/btrfs/ref-verify.c           |   3 +
> >   fs/btrfs/relocation.c           |  11 +-
> >   fs/btrfs/sysfs.c                |  26 ++++
> >   fs/btrfs/transaction.c          |  11 +-
> >   fs/btrfs/tree-checker.c         |   3 +
> >   include/uapi/linux/btrfs.h      |   1 +
> >   include/uapi/linux/btrfs_tree.h |  13 ++
> >   18 files changed, 471 insertions(+), 64 deletions(-)
> > 
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/9] btrfs: track original extent subvol in a new inline ref
  2023-05-03  3:17   ` Qu Wenruo
  2023-05-04 16:17     ` Boris Burkov
@ 2023-05-10 16:57     ` David Sterba
  2023-05-11  0:07       ` Qu Wenruo
  2023-05-13  6:31       ` Qu Wenruo
  1 sibling, 2 replies; 23+ messages in thread
From: David Sterba @ 2023-05-10 16:57 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Boris Burkov, linux-btrfs, kernel-team

On Wed, May 03, 2023 at 11:17:12AM +0800, Qu Wenruo wrote:
> On 2023/5/3 08:59, Boris Burkov wrote:
> > In order to implement simple quota groups, we need to be able to
> > associate a data extent with the subvolume that created it. Once you
> > account for reflink, this information cannot be recovered without
> > explicitly storing it. Options for storing it are:
> > - a new key/item
> > - a new extent inline ref item
> > 
> > The former is backwards compatible, but wastes space, the latter is
> > incompat, but is efficient in space and reuses the existing inline ref
> > machinery, while only abusing it a tiny amount -- specifically, the new
> > item is not a ref, per-se.
> 
> Even we introduce new extent tree items, we can still mark the fs compat_ro.
> 
> As long as we don't do any writes, we can still read the fs without any 
> compatibility problem, and the enable/disable should be addressed by 
> btrfstune/mkfs anyway.

There a was a discussion today how the simple quotas should be enabled.
We have 3 ways, ioctl, mkfs and btrfstune. Currently the qgroups can be
enabled by an ioctl and newly at mkfs time.

For squotas I'd do the same, for interface parity and because the quotas
are a feature that allows that, it's an accounting layer on top of the
extent structures. Other mkfs features are once and for the whole
filesystem lifetime.

You suggest to avoid doing ioctl, which I'd understand to avoid all the
problems with races and deadlocks that we have been fixing. Fortunatelly
the quota enable ioctl is extensible so we can add the squota
enable/disable commands and built on top of the whole quota
infrastructure we already have.

In addition the mkfs enabling should work too, like for qgroups. I think
we should support the use case when the need to start accounting data
comes later than mkfs and unmounting the filesystem is not feasible.

This also follows the existing usage of the generic quotas that can be
enabled or disabled as needed.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/9] btrfs: track original extent subvol in a new inline ref
  2023-05-10 16:57     ` David Sterba
@ 2023-05-11  0:07       ` Qu Wenruo
  2023-05-13  6:31       ` Qu Wenruo
  1 sibling, 0 replies; 23+ messages in thread
From: Qu Wenruo @ 2023-05-11  0:07 UTC (permalink / raw)
  To: dsterba; +Cc: Boris Burkov, linux-btrfs, kernel-team



On 2023/5/11 00:57, David Sterba wrote:
> On Wed, May 03, 2023 at 11:17:12AM +0800, Qu Wenruo wrote:
>> On 2023/5/3 08:59, Boris Burkov wrote:
>>> In order to implement simple quota groups, we need to be able to
>>> associate a data extent with the subvolume that created it. Once you
>>> account for reflink, this information cannot be recovered without
>>> explicitly storing it. Options for storing it are:
>>> - a new key/item
>>> - a new extent inline ref item
>>>
>>> The former is backwards compatible, but wastes space, the latter is
>>> incompat, but is efficient in space and reuses the existing inline ref
>>> machinery, while only abusing it a tiny amount -- specifically, the new
>>> item is not a ref, per-se.
>>
>> Even we introduce new extent tree items, we can still mark the fs compat_ro.
>>
>> As long as we don't do any writes, we can still read the fs without any
>> compatibility problem, and the enable/disable should be addressed by
>> btrfstune/mkfs anyway.
>
> There a was a discussion today how the simple quotas should be enabled.
> We have 3 ways, ioctl, mkfs and btrfstune. Currently the qgroups can be
> enabled by an ioctl and newly at mkfs time.
>
> For squotas I'd do the same, for interface parity and because the quotas
> are a feature that allows that, it's an accounting layer on top of the
> extent structures. Other mkfs features are once and for the whole
> filesystem lifetime.

OK, ioctl is still better than mount option, so it's acceptable to me.

The other concern is, would this be compat_ro or incompat?

I want to ensure extent tree change still to be compat_ro, which may
requires us to make sure unsupported compat_ro flags would not cause any
extent tree read.

We have already skipped bg items search, but we still read the extent
tree root.

Thanks,
Qu
>
> You suggest to avoid doing ioctl, which I'd understand to avoid all the
> problems with races and deadlocks that we have been fixing. Fortunatelly
> the quota enable ioctl is extensible so we can add the squota
> enable/disable commands and built on top of the whole quota
> infrastructure we already have.
>
> In addition the mkfs enabling should work too, like for qgroups. I think
> we should support the use case when the need to start accounting data
> comes later than mkfs and unmounting the filesystem is not feasible.
>
> This also follows the existing usage of the generic quotas that can be
> enabled or disabled as needed.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/9] btrfs: track original extent subvol in a new inline ref
  2023-05-10 16:57     ` David Sterba
  2023-05-11  0:07       ` Qu Wenruo
@ 2023-05-13  6:31       ` Qu Wenruo
  1 sibling, 0 replies; 23+ messages in thread
From: Qu Wenruo @ 2023-05-13  6:31 UTC (permalink / raw)
  To: dsterba; +Cc: Boris Burkov, linux-btrfs, kernel-team



On 2023/5/11 00:57, David Sterba wrote:
> On Wed, May 03, 2023 at 11:17:12AM +0800, Qu Wenruo wrote:
>> On 2023/5/3 08:59, Boris Burkov wrote:
>>> In order to implement simple quota groups, we need to be able to
>>> associate a data extent with the subvolume that created it. Once you
>>> account for reflink, this information cannot be recovered without
>>> explicitly storing it. Options for storing it are:
>>> - a new key/item
>>> - a new extent inline ref item
>>>
>>> The former is backwards compatible, but wastes space, the latter is
>>> incompat, but is efficient in space and reuses the existing inline ref
>>> machinery, while only abusing it a tiny amount -- specifically, the new
>>> item is not a ref, per-se.
>>
>> Even we introduce new extent tree items, we can still mark the fs compat_ro.
>>
>> As long as we don't do any writes, we can still read the fs without any
>> compatibility problem, and the enable/disable should be addressed by
>> btrfstune/mkfs anyway.
>
> There a was a discussion today how the simple quotas should be enabled.
> We have 3 ways, ioctl, mkfs and btrfstune. Currently the qgroups can be
> enabled by an ioctl and newly at mkfs time.
>
> For squotas I'd do the same, for interface parity and because the quotas
> are a feature that allows that, it's an accounting layer on top of the
> extent structures. Other mkfs features are once and for the whole
> filesystem lifetime.
>
> You suggest to avoid doing ioctl, which I'd understand to avoid all the
> problems with races and deadlocks that we have been fixing. Fortunatelly
> the quota enable ioctl is extensible so we can add the squota
> enable/disable commands and built on top of the whole quota
> infrastructure we already have.
>
> In addition the mkfs enabling should work too, like for qgroups. I think
> we should support the use case when the need to start accounting data
> comes later than mkfs and unmounting the filesystem is not feasible.
>
> This also follows the existing usage of the generic quotas that can be
> enabled or disabled as needed.

BTW, if we go ioctl method, there may be more trade-off to do between
dedicated tree and inside extent tree:

- Scan progress
   If go regular extent tree, we need to update quite a large part (if
   not the whole) of the extent tree, for both enable and disable.

   If go dedicate tree, it's at least less large as the extent tree.
   For the item space inefficiency, we can pack several <bytenr, owner>
   pair into the leaf items, a little like how we handle csum items.

- Subvolume deletion (without any snapshot)
   For regular extent tree, it's less a concern, as we need to delete
   related extents anyway.

   But for dedicated tree if the subvolume is large enough, we may update
   the whole dedicate tree in just one transaction.
   (although it should still be smaller than the extent tree).

- Compatibility with extent tree v2
   For regular extent tree, it's pretty straight forward.
   But for dedicated tree, should we split the tree?

Thanks,
Qu

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2023-05-13  6:32 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-03  0:59 [PATCH RFC 0/9] btrfs: simple quotas Boris Burkov
2023-05-03  0:59 ` [PATCH 1/9] btrfs: simple quotas mode Boris Burkov
2023-05-03  2:38   ` Qu Wenruo
2023-05-03  0:59 ` [PATCH 2/9] btrfs: new function for recording simple quota usage Boris Burkov
2023-05-03  0:59 ` [PATCH 3/9] btrfs: track original extent subvol in a new inline ref Boris Burkov
2023-05-03  3:17   ` Qu Wenruo
2023-05-04 16:17     ` Boris Burkov
2023-05-04 21:49       ` Qu Wenruo
2023-05-09 23:58         ` David Sterba
2023-05-10 16:57     ` David Sterba
2023-05-11  0:07       ` Qu Wenruo
2023-05-13  6:31       ` Qu Wenruo
2023-05-03  0:59 ` [PATCH 4/9] btrfs: track metadata owning root in delayed refs Boris Burkov
2023-05-03  0:59 ` [PATCH 5/9] btrfs: record simple quota deltas Boris Burkov
2023-05-03  0:59 ` [PATCH 6/9] btrfs: auto hierarchy for simple qgroups of nested subvols Boris Burkov
2023-05-03  4:47   ` Qu Wenruo
2023-05-04 16:34     ` Boris Burkov
2023-05-04 21:55       ` Qu Wenruo
2023-05-03  0:59 ` [PATCH 7/9] btrfs: check generation when recording simple quota delta Boris Burkov
2023-05-03  0:59 ` [PATCH 8/9] btrfs: expose the qgroup mode via sysfs Boris Burkov
2023-05-03  0:59 ` [PATCH 9/9] btrfs: free qgroup rsv on io failure Boris Burkov
2023-05-05  4:13 ` [PATCH RFC 0/9] btrfs: simple quotas Anand Jain
2023-05-10  1:09   ` Boris Burkov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.