All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/18] btrfs: simple quotas
@ 2023-07-27 22:12 Boris Burkov
  2023-07-27 22:12 ` [PATCH v5 01/18] btrfs: introduce quota mode Boris Burkov
                   ` (18 more replies)
  0 siblings, 19 replies; 53+ messages in thread
From: Boris Burkov @ 2023-07-27 22:12 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

btrfs quota groups (qgroups) are a compelling feature of btrfs that
allow flexible control for limiting subvolume data and metadata usage.
However, due to btrfs's high level decision to tradeoff snapshot
performance against ref-counting performance, qgroups suffer from
non-trivial performance issues that make them unattractive in certain
workloads. Particularly, frequent backref walking during writes and
during commits can make operations increasingly expensive as the number
of snapshots scales up. For that reason, we have never been able to
commit to using qgroups in production at Meta, despite significant
interest from people running container workloads, where we would benefit
from protecting the rest of the host from a buggy application in a
container running away with disk usage. This patch series introduces a
simplified version of qgroups called
simple quotas (squotas) which never computes global reference counts
for extents, and thus has similar performance characteristics to normal,
quotas disabled, btrfs. The "trick" is that in simple quotas mode, we
account all extents permanently to the subvolume in which they were
originally created. That allows us to make all accounting 1:1 with
extent item lifetime, removing the need to walk backrefs. However,
this sacrifices the ability to compute shared vs. exclusive usage. It
also results in counter-intuitive, though still predictable and simple
accounting in the cases where an original extent is removed while a
shared copy still exists. Qgroups is able to detect that case and count
the remaining copy as an exclusive owner, while squotas is not. As a
result, squotas works best when the original extent is immutable and
outlives any clones.

==Format Change==
In order to track the original creating subvolume of a data extent in
the face of reflinks, it is necessary to add additional accounting to
the extent item. To save space, this is done with a new inline ref item.
However, the downside of this approach is that it makes enabling squota
an incompat change, denoted by the new incompat bit SIMPLE_QUOTA. When
this bit is set and quotas are enabled, new extent items get the extra
accounting, and freed extent items check for the accounting to find
their creating subvolume. In addition, 1:1 with this incompat bit,
the quota status item now tracks a "quota enablement generation" needed
for properly handling deleting extents with predate enablement.

==API==
Squotas reuses the api of qgroups. The only difference is that when you
enable quotas via `btrfs quota enable`, you pass the `--simple` flag.
Squotas will always report exclusive == shared for each qgroup. Squotas
deal with extent_item/metadata_item sizes and thus do not do anything
special with compression. Squotas also introduce auto inheritance for
nested subvols. The API is documented more fully in the documentation
patches in btrfs-progs.

==Testing methodology==
Using updated btrfs-progs and fstests (relevant matching patch sets to
be sent ASAP)
btrfs-progs: https://github.com/boryas/btrfs-progs/tree/squota-progs
fstests: https://github.com/boryas/fstests/tree/squota-test

I ran '-g auto' on fstests on the following configurations:
1a) baseline kernel/progs/fstests.
1b) squota kernel baseline progs/fstests.
2a) baseline kernel/progs/fstests. fstests configured to mkfs with quota
2b) squota kernel/progs/fstests. fstests configured to mkfs with squota

I compared 1a against 1b and 2a against 2b and detected no regressions.
2a/2b both exhibit regressions against 1a/1b which are largely issues
with quota reservations in various complicated cases. I intend to run
those down in the future, but they are not simple quota specific, as
they are already broken with plain qgroups.

==Performance Testing==
I measured the performance of the change using fsperf. I ran with 3
configurations using the squota kernel:
- plain mkfs
- qgroup mkfs
- squota mkfs
And added a new performance test which creates 1000 files in a subvol,
creates 100 snapshots of that subvol, then unshares extents in files in
the snapshots. I measured write performance with fio and btrfs commit
critical section performance side effects with bpftrace on
'wait_current_trans'.

The results for the test which measures unshare perf (unshare.py) with
qgroup and squota compared to the baseline:

group test results
unshare results
          metric              baseline       current        stdev            diff
========================================================================================
avg_commit_ms                     162.13        285.75          3.14     76.24%
bg_count                              16            16             0      0.00%
commits                           378.20           379          1.92      0.21%
elapsed                           201.40        270.40          1.34     34.26%
end_state_mount_ns           26036211.60   26004593.60    2281065.40     -0.12%
end_state_umount_ns             2.45e+09      2.55e+09   20740154.41      3.93%
max_commit_ms                     425.80           594         53.34     39.50%
sys_cpu                             0.10          0.06          0.06    -42.15%
wait_current_trans_calls         2945.60       3405.20         47.08     15.60%
wait_current_trans_ns_max       1.56e+08      3.43e+08   32659393.25    120.07%
wait_current_trans_ns_mean    1974875.35   28588482.55    1557588.84   1347.61%
wait_current_trans_ns_min            232           232         25.88      0.00%
wait_current_trans_ns_p50            718           740         22.80      3.06%
wait_current_trans_ns_p95     7711770.20      2.21e+08   17241032.09   2761.19%
wait_current_trans_ns_p99    67744932.29      2.68e+08   41275815.87    295.16%
write_bw_bytes                 653008.80     486344.40       4209.91    -25.52%
write_clat_ns_mean            6251404.78    8406837.89      39779.15     34.48%
write_clat_ns_p50             1656422.40    1643315.20      27415.68     -0.79%
write_clat_ns_p99               1.90e+08      3.20e+08       2097152     68.62%
write_io_kbytes                   128000        128000             0      0.00%
write_iops                        159.43        118.74          1.03    -25.52%
write_lat_ns_max                7.06e+08      9.80e+08   47324816.61     38.88%
write_lat_ns_mean             6251503.06    8406936.06      39780.83     34.48%
write_lat_ns_min                    3354          4648        616.06     38.58%

squota test results
unshare results
          metric              baseline       current        stdev            diff
========================================================================================
avg_commit_ms                     162.13        164.16          3.14      1.25%
bg_count                              16             0             0   -100.00%
commits                           378.20        380.80          1.92      0.69%
elapsed                           201.40        208.20          1.34      3.38%
end_state_mount_ns           26036211.60   25840729.60    2281065.40     -0.75%
end_state_umount_ns             2.45e+09      3.01e+09   20740154.41     22.80%
max_commit_ms                     425.80        415.80         53.34     -2.35%
sys_cpu                             0.10          0.08          0.06    -23.36%
wait_current_trans_calls         2945.60       2981.60         47.08      1.22%
wait_current_trans_ns_max       1.56e+08      1.12e+08   32659393.25    -27.86%
wait_current_trans_ns_mean    1974875.35    1064734.76    1557588.84    -46.09%
wait_current_trans_ns_min            232           238         25.88      2.59%
wait_current_trans_ns_p50            718           746         22.80      3.90%
wait_current_trans_ns_p95     7711770.20       1567.60   17241032.09    -99.98%
wait_current_trans_ns_p99    67744932.29   49880514.27   41275815.87    -26.37%
write_bw_bytes                 653008.80        631256       4209.91     -3.33%
write_clat_ns_mean            6251404.78    6476816.06      39779.15      3.61%
write_clat_ns_p50             1656422.40       1581056      27415.68     -4.55%
write_clat_ns_p99               1.90e+08      1.94e+08       2097152      2.21%
write_io_kbytes                   128000        128000             0      0.00%
write_iops                        159.43        154.12          1.03     -3.33%
write_lat_ns_max                7.06e+08      7.65e+08   47324816.61      8.38%
write_lat_ns_mean             6251503.06    6476912.76      39780.83      3.61%
write_lat_ns_min                    3354          4062        616.06     21.11%

And the same, but only showing results where the deviation was outside
of a 95% confidence interval for the mean (default significance
highlighting in fsperf):
qgroup test results
unshare results
          metric              baseline       current        stdev            diff
========================================================================================
avg_commit_ms                     162.13        285.75          3.14     76.24%
elapsed                           201.40        270.40          1.34     34.26%
end_state_umount_ns             2.45e+09      2.55e+09   20740154.41      3.93%
max_commit_ms                     425.80           594         53.34     39.50%
wait_current_trans_calls         2945.60       3405.20         47.08     15.60%
wait_current_trans_ns_max       1.56e+08      3.43e+08   32659393.25    120.07%
wait_current_trans_ns_mean    1974875.35   28588482.55    1557588.84   1347.61%
wait_current_trans_ns_p95     7711770.20      2.21e+08   17241032.09   2761.19%
wait_current_trans_ns_p99    67744932.29      2.68e+08   41275815.87    295.16%
write_bw_bytes                 653008.80     486344.40       4209.91    -25.52%
write_clat_ns_mean            6251404.78    8406837.89      39779.15     34.48%
write_clat_ns_p99               1.90e+08      3.20e+08       2097152     68.62%
write_iops                        159.43        118.74          1.03    -25.52%
write_lat_ns_max                7.06e+08      9.80e+08   47324816.61     38.88%
write_lat_ns_mean             6251503.06    8406936.06      39780.83     34.48%
write_lat_ns_min                    3354          4648        616.06     38.58%

squota test results
unshare results
          metric              baseline       current        stdev            diff
========================================================================================
elapsed                           201.40        208.20          1.34      3.38%
end_state_umount_ns             2.45e+09      3.01e+09   20740154.41     22.80%
write_bw_bytes                 653008.80        631256       4209.91     -3.33%
write_clat_ns_mean            6251404.78    6476816.06      39779.15      3.61%
write_clat_ns_p50             1656422.40       1581056      27415.68     -4.55%
write_clat_ns_p99               1.90e+08      1.94e+08       2097152      2.21%
write_iops                        159.43        154.12          1.03     -3.33%
write_lat_ns_mean             6251503.06    6476912.76      39780.83      3.61%

Particularly noteworthy are the massive regressions to
wait_current_trans in qgroup mode as well as the solid regressions to
bandwidth, iops and write latency. The regressions/improvements in
squotas are modest in comparison in line with the expectation. I am
still investigating the squota umount regression, particularly whether
it is in the umount's final commit and represents a real performance
problem with squotas.

Link: https://github.com/boryas/btrfs-progs/tree/squota-progs
Link: https://github.com/boryas/fstests/tree/squota-test
Link: https://github.com/boryas/fsperf/tree/unshare-victim

---
Changelog:
v5:
* fix btrfs/187 failure in squota mode: relocation+dedupe led to drop
  refs with the wrong owning root coming first, followed by drop refs
  with the real owning root. The "bad" ones are never last, so fix it by
  letting the "good" ones set it on the head ref.
v4:
* drop unrelated patches folded into misc-next
* fix crash where check_committed_extent was reading the inline ref type
  on an extent item with no inline extents. (btrfs/192 *without* squotas
  enabled)
v3:
* u64 -> __le64 in new owner_ref_item (as caught by kernel test bot)
v2:
* fix dumb formatting errors, unexpected/unrelated edits
* use command instead of status in ioctl
* fix the illegal GFP_KERNEL in delta fn (punted on pulling allocations
  out from the spin lock and using GFP_ATOMIC like other qgroups use
  cases for now. Plan to fix that in either v3 or a follow up series, as
  there are other places this is an issue for qgroups/squotas)
* improve boolean logic in head_ref init
* use list_count helper function instead of rolling my own bad one
* fixed the adjacent extents reloc cluster bug Josef noticed
* fixed a qgroups bug I introduced: it needs to be able to account
  extents while shutting down to not hit a warning in commit_transaction
* added a qgroup_status flag for simple quotas to not rely on the
  incompat bit directly. This allows disabling simple quotas and
  enabling qgroups.


Boris Burkov (18):
  btrfs: introduce quota mode
  btrfs: add new quota mode for simple quotas
  btrfs: expose quota mode via sysfs
  btrfs: add simple_quota incompat feature to sysfs
  btrfs: flush reservations during quota disable
  btrfs: create qgroup earlier in snapshot creation
  btrfs: function for recording simple quota deltas
  btrfs: rename tree_ref and data_ref owning_root
  btrfs: track owning root in btrfs_ref
  btrfs: track original extent owner in head_ref
  btrfs: new inline ref storing owning subvol of data extents
  btrfs: inline owner ref lookup helper
  btrfs: record simple quota deltas
  btrfs: simple quota auto hierarchy for nested subvols
  btrfs: check generation when recording simple quota delta
  btrfs: track metadata relocation cow with simple quota
  btrfs: track data relocation with simple quota
  btrfs: only set QUOTA_ENABLED when done reading qgroups

 fs/btrfs/accessors.h            |   6 +
 fs/btrfs/backref.c              |   3 +
 fs/btrfs/ctree.c                |  22 ++-
 fs/btrfs/ctree.h                |   1 +
 fs/btrfs/delayed-ref.c          |  35 ++--
 fs/btrfs/delayed-ref.h          |  32 +++-
 fs/btrfs/disk-io.c              |   5 +-
 fs/btrfs/extent-tree.c          | 242 ++++++++++++++++++++++-----
 fs/btrfs/extent-tree.h          |   6 +-
 fs/btrfs/file.c                 |  10 +-
 fs/btrfs/fs.h                   |   7 +-
 fs/btrfs/inode-item.c           |   2 +-
 fs/btrfs/ioctl.c                |   7 +-
 fs/btrfs/print-tree.c           |  12 ++
 fs/btrfs/qgroup.c               | 286 +++++++++++++++++++++++++++-----
 fs/btrfs/qgroup.h               |  28 +++-
 fs/btrfs/ref-verify.c           |   7 +-
 fs/btrfs/relocation.c           |  66 +++++++-
 fs/btrfs/root-tree.c            |   2 +-
 fs/btrfs/sysfs.c                |  28 ++++
 fs/btrfs/transaction.c          |  19 ++-
 fs/btrfs/tree-checker.c         |   3 +
 fs/btrfs/tree-log.c             |   3 +-
 include/uapi/linux/btrfs.h      |   2 +
 include/uapi/linux/btrfs_tree.h |  27 ++-
 25 files changed, 718 insertions(+), 143 deletions(-)

-- 
2.41.0


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v5 01/18] btrfs: introduce quota mode
  2023-07-27 22:12 [PATCH v5 00/18] btrfs: simple quotas Boris Burkov
@ 2023-07-27 22:12 ` Boris Burkov
  2023-07-27 22:12 ` [PATCH v5 02/18] btrfs: add new quota mode for simple quotas Boris Burkov
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 53+ messages in thread
From: Boris Burkov @ 2023-07-27 22:12 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

In preparation for introducing simple quotas, change from a binary
setting for quotas to an enum based mode. Initially, the possible modes
are disabled/full. Full quotas is normal btrfs qgroups.

Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/qgroup.c | 7 +++++++
 fs/btrfs/qgroup.h | 6 ++++++
 2 files changed, 13 insertions(+)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 2637d6b157ff..0a2085ae9bcd 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -30,6 +30,13 @@
 #include "root-tree.h"
 #include "tree-checker.h"
 
+enum btrfs_qgroup_mode btrfs_qgroup_mode(struct btrfs_fs_info *fs_info)
+{
+	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
+		return BTRFS_QGROUP_MODE_DISABLED;
+	return BTRFS_QGROUP_MODE_FULL;
+}
+
 /*
  * Helpers to access qgroup reservation
  *
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 7bffa10589d6..bb15e55f00b8 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -250,6 +250,12 @@ enum {
 };
 
 int btrfs_quota_enable(struct btrfs_fs_info *fs_info);
+enum btrfs_qgroup_mode {
+	BTRFS_QGROUP_MODE_DISABLED,
+	BTRFS_QGROUP_MODE_FULL,
+};
+
+enum btrfs_qgroup_mode btrfs_qgroup_mode(struct btrfs_fs_info *fs_info);
 int btrfs_quota_disable(struct btrfs_fs_info *fs_info);
 int btrfs_qgroup_rescan(struct btrfs_fs_info *fs_info);
 void btrfs_qgroup_rescan_resume(struct btrfs_fs_info *fs_info);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 02/18] btrfs: add new quota mode for simple quotas
  2023-07-27 22:12 [PATCH v5 00/18] btrfs: simple quotas Boris Burkov
  2023-07-27 22:12 ` [PATCH v5 01/18] btrfs: introduce quota mode Boris Burkov
@ 2023-07-27 22:12 ` Boris Burkov
  2023-08-21 18:00   ` Josef Bacik
  2023-09-07 11:19   ` David Sterba
  2023-07-27 22:12 ` [PATCH v5 03/18] btrfs: expose quota mode via sysfs Boris Burkov
                   ` (16 subsequent siblings)
  18 siblings, 2 replies; 53+ messages in thread
From: Boris Burkov @ 2023-07-27 22:12 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Add a new quota mode called "simple quotas". It can be enabled by the
existing quota enable ioctl via a new command, and sets an incompat
bit, as the implementation of simple quotas will make backwards
incompatible changes to the disk format of the extent tree.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/delayed-ref.c          |  4 +-
 fs/btrfs/fs.h                   |  5 +-
 fs/btrfs/ioctl.c                |  3 +-
 fs/btrfs/qgroup.c               | 91 +++++++++++++++++++++++----------
 fs/btrfs/qgroup.h               |  4 +-
 fs/btrfs/root-tree.c            |  2 +-
 fs/btrfs/transaction.c          |  4 +-
 include/uapi/linux/btrfs.h      |  2 +
 include/uapi/linux/btrfs_tree.h | 14 ++++-
 9 files changed, 91 insertions(+), 38 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 6a13cf00218b..a9b938d3a531 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -898,7 +898,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
 		return -ENOMEM;
 	}
 
-	if (test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags) &&
+	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_DISABLED &&
 	    !generic_ref->skip_qgroup) {
 		record = kzalloc(sizeof(*record), GFP_NOFS);
 		if (!record) {
@@ -1002,7 +1002,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
 		return -ENOMEM;
 	}
 
-	if (test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags) &&
+	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_DISABLED &&
 	    !generic_ref->skip_qgroup) {
 		record = kzalloc(sizeof(*record), GFP_NOFS);
 		if (!record) {
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 203d2a267828..f76f450c2abf 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -218,7 +218,8 @@ enum {
 	 BTRFS_FEATURE_INCOMPAT_NO_HOLES	|	\
 	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID	|	\
 	 BTRFS_FEATURE_INCOMPAT_RAID1C34	|	\
-	 BTRFS_FEATURE_INCOMPAT_ZONED)
+	 BTRFS_FEATURE_INCOMPAT_ZONED		|	\
+	 BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA)
 
 #ifdef CONFIG_BTRFS_DEBUG
 	/*
@@ -233,7 +234,6 @@ enum {
 
 #define BTRFS_FEATURE_INCOMPAT_SUPP		\
 	(BTRFS_FEATURE_INCOMPAT_SUPP_STABLE)
-
 #endif
 
 #define BTRFS_FEATURE_INCOMPAT_SAFE_SET			\
@@ -790,7 +790,6 @@ struct btrfs_fs_info {
 	struct lockdep_map btrfs_state_change_map[4];
 	struct lockdep_map btrfs_trans_pending_ordered_map;
 	struct lockdep_map btrfs_ordered_extent_map;
-
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	spinlock_t ref_verify_lock;
 	struct rb_root block_tree;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index a895d105464b..9b61bc62e439 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3691,7 +3691,8 @@ static long btrfs_ioctl_quota_ctl(struct file *file, void __user *arg)
 
 	switch (sa->cmd) {
 	case BTRFS_QUOTA_CTL_ENABLE:
-		ret = btrfs_quota_enable(fs_info);
+	case BTRFS_QUOTA_CTL_ENABLE_SIMPLE_QUOTA:
+		ret = btrfs_quota_enable(fs_info, sa);
 		break;
 	case BTRFS_QUOTA_CTL_DISABLE:
 		ret = btrfs_quota_disable(fs_info);
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 0a2085ae9bcd..558f66994667 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -34,6 +34,8 @@ enum btrfs_qgroup_mode btrfs_qgroup_mode(struct btrfs_fs_info *fs_info)
 {
 	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
 		return BTRFS_QGROUP_MODE_DISABLED;
+	if (fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_SIMPLE)
+		return BTRFS_QGROUP_MODE_SIMPLE;
 	return BTRFS_QGROUP_MODE_FULL;
 }
 
@@ -347,6 +349,8 @@ int btrfs_verify_qgroup_counts(struct btrfs_fs_info *fs_info, u64 qgroupid,
 
 static void qgroup_mark_inconsistent(struct btrfs_fs_info *fs_info)
 {
+	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
+		return;
 	fs_info->qgroup_flags |= (BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT |
 				  BTRFS_QGROUP_RUNTIME_FLAG_CANCEL_RESCAN |
 				  BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING);
@@ -367,8 +371,9 @@ int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info)
 	int ret = 0;
 	u64 flags = 0;
 	u64 rescan_progress = 0;
+	bool simple;
 
-	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
+	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_DISABLED)
 		return 0;
 
 	fs_info->qgroup_ulist = ulist_alloc(GFP_KERNEL);
@@ -418,14 +423,14 @@ int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info)
 				 "old qgroup version, quota disabled");
 				goto out;
 			}
+			fs_info->qgroup_flags = btrfs_qgroup_status_flags(l, ptr);
+			simple = fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_SIMPLE;
 			if (btrfs_qgroup_status_generation(l, ptr) !=
-			    fs_info->generation) {
+			    fs_info->generation && !simple) {
 				qgroup_mark_inconsistent(fs_info);
 				btrfs_err(fs_info,
 					"qgroup generation mismatch, marked as inconsistent");
 			}
-			fs_info->qgroup_flags = btrfs_qgroup_status_flags(l,
-									  ptr);
 			rescan_progress = btrfs_qgroup_status_rescan(l, ptr);
 			goto next1;
 		}
@@ -557,7 +562,7 @@ bool btrfs_check_quota_leak(struct btrfs_fs_info *fs_info)
 	struct rb_node *node;
 	bool ret = false;
 
-	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
+	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_DISABLED)
 		return ret;
 	/*
 	 * Since we're unmounting, there is no race and no need to grab qgroup
@@ -956,7 +961,8 @@ static int btrfs_clean_quota_tree(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
-int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
+int btrfs_quota_enable(struct btrfs_fs_info *fs_info,
+		       struct btrfs_ioctl_quota_ctl_args *quota_ctl_args)
 {
 	struct btrfs_root *quota_root;
 	struct btrfs_root *tree_root = fs_info->tree_root;
@@ -968,6 +974,7 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
 	struct btrfs_qgroup *qgroup = NULL;
 	struct btrfs_trans_handle *trans = NULL;
 	struct ulist *ulist = NULL;
+	bool simple = quota_ctl_args->cmd == BTRFS_QUOTA_CTL_ENABLE_SIMPLE_QUOTA;
 	int ret = 0;
 	int slot;
 
@@ -1070,8 +1077,11 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
 				 struct btrfs_qgroup_status_item);
 	btrfs_set_qgroup_status_generation(leaf, ptr, trans->transid);
 	btrfs_set_qgroup_status_version(leaf, ptr, BTRFS_QGROUP_STATUS_VERSION);
-	fs_info->qgroup_flags = BTRFS_QGROUP_STATUS_FLAG_ON |
-				BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
+	fs_info->qgroup_flags = BTRFS_QGROUP_STATUS_FLAG_ON;
+	if (simple)
+		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_SIMPLE;
+	else
+		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
 	btrfs_set_qgroup_status_flags(leaf, ptr, fs_info->qgroup_flags &
 				      BTRFS_QGROUP_STATUS_FLAGS_MASK);
 	btrfs_set_qgroup_status_rescan(leaf, ptr, 0);
@@ -1187,8 +1197,14 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
 	spin_lock(&fs_info->qgroup_lock);
 	fs_info->quota_root = quota_root;
 	set_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags);
+	if (simple)
+		btrfs_set_fs_incompat(fs_info, SIMPLE_QUOTA);
 	spin_unlock(&fs_info->qgroup_lock);
 
+	/* Skip rescan for simple qgroups */
+	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
+		goto out_free_path;
+
 	ret = qgroup_rescan_init(fs_info, 0, 1);
 	if (!ret) {
 	        qgroup_rescan_zero_tracking(fs_info);
@@ -1302,6 +1318,7 @@ int btrfs_quota_disable(struct btrfs_fs_info *fs_info)
 	quota_root = fs_info->quota_root;
 	fs_info->quota_root = NULL;
 	fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_ON;
+	fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_SIMPLE;
 	fs_info->qgroup_drop_subtree_thres = BTRFS_MAX_LEVEL;
 	spin_unlock(&fs_info->qgroup_lock);
 
@@ -1787,6 +1804,9 @@ int btrfs_qgroup_trace_extent_nolock(struct btrfs_fs_info *fs_info,
 	struct btrfs_qgroup_extent_record *entry;
 	u64 bytenr = record->bytenr;
 
+	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
+		return 0;
+
 	lockdep_assert_held(&delayed_refs->lock);
 	trace_btrfs_qgroup_trace_extent(fs_info, record);
 
@@ -1819,6 +1839,8 @@ int btrfs_qgroup_trace_extent_post(struct btrfs_trans_handle *trans,
 	struct btrfs_backref_walk_ctx ctx = { 0 };
 	int ret;
 
+	if (btrfs_qgroup_mode(trans->fs_info) != BTRFS_QGROUP_MODE_FULL)
+		return 0;
 	/*
 	 * We are always called in a context where we are already holding a
 	 * transaction handle. Often we are called when adding a data delayed
@@ -1874,7 +1896,7 @@ int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans, u64 bytenr,
 	struct btrfs_delayed_ref_root *delayed_refs;
 	int ret;
 
-	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags)
+	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL
 	    || bytenr == 0 || num_bytes == 0)
 		return 0;
 	record = kzalloc(sizeof(*record), GFP_NOFS);
@@ -1907,7 +1929,7 @@ int btrfs_qgroup_trace_leaf_items(struct btrfs_trans_handle *trans,
 	u64 bytenr, num_bytes;
 
 	/* We can be called directly from walk_up_proc() */
-	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
+	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
 		return 0;
 
 	for (i = 0; i < nr; i++) {
@@ -2283,7 +2305,7 @@ static int qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans,
 	int level;
 	int ret;
 
-	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
+	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
 		return 0;
 
 	/* Wrong parameter order */
@@ -2340,7 +2362,7 @@ int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle *trans,
 	BUG_ON(root_level < 0 || root_level >= BTRFS_MAX_LEVEL);
 	BUG_ON(root_eb == NULL);
 
-	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
+	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
 		return 0;
 
 	spin_lock(&fs_info->qgroup_lock);
@@ -2680,7 +2702,7 @@ int btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans, u64 bytenr,
 	 * If quotas get disabled meanwhile, the resources need to be freed and
 	 * we can't just exit here.
 	 */
-	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags) ||
+	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL ||
 	    fs_info->qgroup_flags & BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING)
 		goto out_free;
 
@@ -2768,6 +2790,9 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans)
 	u64 qgroup_to_skip;
 	int ret = 0;
 
+	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
+		return 0;
+
 	delayed_refs = &trans->transaction->delayed_refs;
 	qgroup_to_skip = delayed_refs->qgroup_to_skip;
 	while ((node = rb_first(&delayed_refs->dirty_extent_root))) {
@@ -2883,7 +2908,7 @@ int btrfs_run_qgroups(struct btrfs_trans_handle *trans)
 			qgroup_mark_inconsistent(fs_info);
 		spin_lock(&fs_info->qgroup_lock);
 	}
-	if (test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
+	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_DISABLED)
 		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_ON;
 	else
 		fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_ON;
@@ -2936,7 +2961,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
 
 	if (!committing)
 		mutex_lock(&fs_info->qgroup_ioctl_lock);
-	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
+	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_DISABLED)
 		goto out;
 
 	quota_root = fs_info->quota_root;
@@ -3010,7 +3035,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
 		qgroup_dirty(fs_info, dstgroup);
 	}
 
-	if (srcid) {
+	if (srcid && btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_FULL) {
 		srcgroup = find_qgroup_rb(fs_info, srcid);
 		if (!srcgroup)
 			goto unlock;
@@ -3302,6 +3327,9 @@ static int qgroup_rescan_leaf(struct btrfs_trans_handle *trans,
 	int slot;
 	int ret;
 
+	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
+		return 1;
+
 	mutex_lock(&fs_info->qgroup_rescan_lock);
 	extent_root = btrfs_extent_root(fs_info,
 				fs_info->qgroup_rescan_progress.objectid);
@@ -3384,8 +3412,8 @@ static bool rescan_should_stop(struct btrfs_fs_info *fs_info)
 {
 	return btrfs_fs_closing(fs_info) ||
 		test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state) ||
-		!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags) ||
-			  fs_info->qgroup_flags & BTRFS_QGROUP_RUNTIME_FLAG_CANCEL_RESCAN;
+		btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_DISABLED ||
+		fs_info->qgroup_flags & BTRFS_QGROUP_RUNTIME_FLAG_CANCEL_RESCAN;
 }
 
 static void btrfs_qgroup_rescan_worker(struct btrfs_work *work)
@@ -3399,6 +3427,9 @@ static void btrfs_qgroup_rescan_worker(struct btrfs_work *work)
 	bool stopped = false;
 	bool did_leaf_rescans = false;
 
+	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
+		return;
+
 	path = btrfs_alloc_path();
 	if (!path)
 		goto out;
@@ -3502,6 +3533,12 @@ qgroup_rescan_init(struct btrfs_fs_info *fs_info, u64 progress_objectid,
 {
 	int ret = 0;
 
+	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE) {
+		btrfs_warn(fs_info, "qgroup rescan init failed, running in simple mode. mode: %d\n",
+			btrfs_qgroup_mode(fs_info));
+		return -EINVAL;
+	}
+
 	if (!init_flags) {
 		/* we're resuming qgroup rescan at mount time */
 		if (!(fs_info->qgroup_flags &
@@ -3532,7 +3569,7 @@ qgroup_rescan_init(struct btrfs_fs_info *fs_info, u64 progress_objectid,
 			btrfs_warn(fs_info,
 			"qgroup rescan init failed, qgroup is not enabled");
 			ret = -EINVAL;
-		} else if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags)) {
+		} else if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_DISABLED) {
 			/* Quota disable is in progress */
 			ret = -EBUSY;
 		}
@@ -3788,7 +3825,7 @@ static int qgroup_reserve_data(struct btrfs_inode *inode,
 	u64 to_reserve;
 	int ret;
 
-	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &root->fs_info->flags) ||
+	if (btrfs_qgroup_mode(root->fs_info) == BTRFS_QGROUP_MODE_DISABLED ||
 	    !is_fstree(root->root_key.objectid) || len == 0)
 		return 0;
 
@@ -3920,7 +3957,7 @@ static int __btrfs_qgroup_release_data(struct btrfs_inode *inode,
 	int trace_op = QGROUP_RELEASE;
 	int ret;
 
-	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &inode->root->fs_info->flags))
+	if (btrfs_qgroup_mode(inode->root->fs_info) == BTRFS_QGROUP_MODE_DISABLED)
 		return 0;
 
 	/* In release case, we shouldn't have @reserved */
@@ -4031,7 +4068,7 @@ int btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	int ret;
 
-	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags) ||
+	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_DISABLED ||
 	    !is_fstree(root->root_key.objectid) || num_bytes == 0)
 		return 0;
 
@@ -4072,7 +4109,7 @@ void btrfs_qgroup_free_meta_all_pertrans(struct btrfs_root *root)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
 
-	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags) ||
+	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_DISABLED ||
 	    !is_fstree(root->root_key.objectid))
 		return;
 
@@ -4088,7 +4125,7 @@ void __btrfs_qgroup_free_meta(struct btrfs_root *root, int num_bytes,
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
 
-	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags) ||
+	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_DISABLED ||
 	    !is_fstree(root->root_key.objectid))
 		return;
 
@@ -4153,7 +4190,7 @@ void btrfs_qgroup_convert_reserved_meta(struct btrfs_root *root, int num_bytes)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
 
-	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags) ||
+	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_DISABLED ||
 	    !is_fstree(root->root_key.objectid))
 		return;
 	/* Same as btrfs_qgroup_free_meta_prealloc() */
@@ -4261,7 +4298,7 @@ int btrfs_qgroup_add_swapped_blocks(struct btrfs_trans_handle *trans,
 	int level = btrfs_header_level(subvol_parent) - 1;
 	int ret = 0;
 
-	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
+	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
 		return 0;
 
 	if (btrfs_node_ptr_generation(subvol_parent, subvol_slot) >
@@ -4371,7 +4408,7 @@ int btrfs_qgroup_trace_subtree_after_cow(struct btrfs_trans_handle *trans,
 	int ret = 0;
 	int i;
 
-	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
+	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
 		return 0;
 	if (!is_fstree(root->root_key.objectid) || !root->reloc_root)
 		return 0;
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index bb15e55f00b8..d4c4d039585f 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -249,13 +249,15 @@ enum {
 	ENUM_BIT(QGROUP_FREE),
 };
 
-int btrfs_quota_enable(struct btrfs_fs_info *fs_info);
 enum btrfs_qgroup_mode {
 	BTRFS_QGROUP_MODE_DISABLED,
 	BTRFS_QGROUP_MODE_FULL,
+	BTRFS_QGROUP_MODE_SIMPLE
 };
 
 enum btrfs_qgroup_mode btrfs_qgroup_mode(struct btrfs_fs_info *fs_info);
+int btrfs_quota_enable(struct btrfs_fs_info *fs_info,
+		       struct btrfs_ioctl_quota_ctl_args *quota_ctl_args);
 int btrfs_quota_disable(struct btrfs_fs_info *fs_info);
 int btrfs_qgroup_rescan(struct btrfs_fs_info *fs_info);
 void btrfs_qgroup_rescan_resume(struct btrfs_fs_info *fs_info);
diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
index 859874579456..044a8c2710f8 100644
--- a/fs/btrfs/root-tree.c
+++ b/fs/btrfs/root-tree.c
@@ -508,7 +508,7 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root *root,
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
 
-	if (test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags)) {
+	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_DISABLED) {
 		/* One for parent inode, two for dir entries */
 		qgroup_num_bytes = 3 * fs_info->nodesize;
 		ret = btrfs_qgroup_reserve_meta_prealloc(root,
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 815f61d6b506..89ff15aa085f 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1529,11 +1529,11 @@ static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
 	int ret;
 
 	/*
-	 * Save some performance in the case that qgroups are not
+	 * Save some performance in the case that full qgroups are not
 	 * enabled. If this check races with the ioctl, rescan will
 	 * kick in anyway.
 	 */
-	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
+	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
 		return 0;
 
 	/*
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index dbb8b96da50d..0e42f4a2121d 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -333,6 +333,7 @@ struct btrfs_ioctl_fs_info_args {
 #define BTRFS_FEATURE_INCOMPAT_RAID1C34		(1ULL << 11)
 #define BTRFS_FEATURE_INCOMPAT_ZONED		(1ULL << 12)
 #define BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2	(1ULL << 13)
+#define BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA	(1ULL << 14)
 
 struct btrfs_ioctl_feature_flags {
 	__u64 compat_flags;
@@ -753,6 +754,7 @@ struct btrfs_ioctl_get_dev_stats {
 #define BTRFS_QUOTA_CTL_ENABLE	1
 #define BTRFS_QUOTA_CTL_DISABLE	2
 #define BTRFS_QUOTA_CTL_RESCAN__NOTUSED	3
+#define BTRFS_QUOTA_CTL_ENABLE_SIMPLE_QUOTA 4
 struct btrfs_ioctl_quota_ctl_args {
 	__u64 cmd;
 	__u64 status;
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index ab38d0f411fa..47aca414a41b 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -1200,9 +1200,21 @@ static inline __u16 btrfs_qgroup_level(__u64 qgroupid)
  */
 #define BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT	(1ULL << 2)
 
+/*
+ * 3rd and 4th bits taken by non-persisted status flags in qgroup.h
+ */
+
+/*
+ * Whether or not this filesystem is using simple quotas.
+ * Not exactly the incompat bit, because we support using simple quotas,
+ * disabling it, then going back to full qgroup quotas.
+ */
+#define BTRFS_QGROUP_STATUS_FLAG_SIMPLE	(1ULL << 5)
+
 #define BTRFS_QGROUP_STATUS_FLAGS_MASK	(BTRFS_QGROUP_STATUS_FLAG_ON |		\
 					 BTRFS_QGROUP_STATUS_FLAG_RESCAN |	\
-					 BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT)
+					 BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT |	\
+					 BTRFS_QGROUP_STATUS_FLAG_SIMPLE)
 
 #define BTRFS_QGROUP_STATUS_VERSION        1
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 03/18] btrfs: expose quota mode via sysfs
  2023-07-27 22:12 [PATCH v5 00/18] btrfs: simple quotas Boris Burkov
  2023-07-27 22:12 ` [PATCH v5 01/18] btrfs: introduce quota mode Boris Burkov
  2023-07-27 22:12 ` [PATCH v5 02/18] btrfs: add new quota mode for simple quotas Boris Burkov
@ 2023-07-27 22:12 ` Boris Burkov
  2023-08-21 18:00   ` Josef Bacik
  2023-09-07 11:25   ` David Sterba
  2023-07-27 22:12 ` [PATCH v5 04/18] btrfs: add simple_quota incompat feature to sysfs Boris Burkov
                   ` (15 subsequent siblings)
  18 siblings, 2 replies; 53+ messages in thread
From: Boris Burkov @ 2023-07-27 22:12 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Add a new sysfs file
/sys/fs/btrfs/<uuid>/qgroups/mode
which prints out the mode qgroups is running in. The possible modes are
disabled, qgroup, and squota

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/sysfs.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index b1d1ac25237b..e53614753391 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -2086,6 +2086,31 @@ static ssize_t qgroup_enabled_show(struct kobject *qgroups_kobj,
 }
 BTRFS_ATTR(qgroups, enabled, qgroup_enabled_show);
 
+static ssize_t qgroup_mode_show(struct kobject *qgroups_kobj,
+				struct kobj_attribute *a,
+				char *buf)
+{
+	struct btrfs_fs_info *fs_info = to_fs_info(qgroups_kobj->parent);
+	char *mode = "";
+
+	spin_lock(&fs_info->qgroup_lock);
+	switch (btrfs_qgroup_mode(fs_info)) {
+	case BTRFS_QGROUP_MODE_DISABLED:
+		mode = "disabled";
+		break;
+	case BTRFS_QGROUP_MODE_FULL:
+		mode = "qgroup";
+		break;
+	case BTRFS_QGROUP_MODE_SIMPLE:
+		mode = "squota";
+		break;
+	}
+	spin_unlock(&fs_info->qgroup_lock);
+
+	return sysfs_emit(buf, "%s\n", mode);
+}
+BTRFS_ATTR(qgroups, mode, qgroup_mode_show);
+
 static ssize_t qgroup_inconsistent_show(struct kobject *qgroups_kobj,
 					struct kobj_attribute *a,
 					char *buf)
@@ -2148,6 +2173,7 @@ static struct attribute *qgroups_attrs[] = {
 	BTRFS_ATTR_PTR(qgroups, enabled),
 	BTRFS_ATTR_PTR(qgroups, inconsistent),
 	BTRFS_ATTR_PTR(qgroups, drop_subtree_threshold),
+	BTRFS_ATTR_PTR(qgroups, mode),
 	NULL
 };
 ATTRIBUTE_GROUPS(qgroups);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 04/18] btrfs: add simple_quota incompat feature to sysfs
  2023-07-27 22:12 [PATCH v5 00/18] btrfs: simple quotas Boris Burkov
                   ` (2 preceding siblings ...)
  2023-07-27 22:12 ` [PATCH v5 03/18] btrfs: expose quota mode via sysfs Boris Burkov
@ 2023-07-27 22:12 ` Boris Burkov
  2023-08-21 18:01   ` Josef Bacik
  2023-09-07 11:28   ` David Sterba
  2023-07-27 22:12 ` [PATCH v5 05/18] btrfs: flush reservations during quota disable Boris Burkov
                   ` (14 subsequent siblings)
  18 siblings, 2 replies; 53+ messages in thread
From: Boris Burkov @ 2023-07-27 22:12 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Add an entry in the features directory for the new incompat flag

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/sysfs.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index e53614753391..f62bba0068ca 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -291,6 +291,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(metadata_uuid, METADATA_UUID);
 BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
 BTRFS_FEAT_ATTR_COMPAT_RO(block_group_tree, BLOCK_GROUP_TREE);
 BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
+BTRFS_FEAT_ATTR_INCOMPAT(simple_quota, SIMPLE_QUOTA);
 #ifdef CONFIG_BLK_DEV_ZONED
 BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
 #endif
@@ -322,6 +323,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(free_space_tree),
 	BTRFS_FEAT_ATTR_PTR(raid1c34),
 	BTRFS_FEAT_ATTR_PTR(block_group_tree),
+	BTRFS_FEAT_ATTR_PTR(simple_quota),
 #ifdef CONFIG_BLK_DEV_ZONED
 	BTRFS_FEAT_ATTR_PTR(zoned),
 #endif
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 05/18] btrfs: flush reservations during quota disable
  2023-07-27 22:12 [PATCH v5 00/18] btrfs: simple quotas Boris Burkov
                   ` (3 preceding siblings ...)
  2023-07-27 22:12 ` [PATCH v5 04/18] btrfs: add simple_quota incompat feature to sysfs Boris Burkov
@ 2023-07-27 22:12 ` Boris Burkov
  2023-07-27 22:12 ` [PATCH v5 06/18] btrfs: create qgroup earlier in snapshot creation Boris Burkov
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 53+ messages in thread
From: Boris Burkov @ 2023-07-27 22:12 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

The following sequence:
enable simple quotas
do some writes
    reserve space
    create ordered_extent
        release rsv (store rsv_bytes in OE, mark QGROUP_RESERVED bits)
disable quotas
enable simple quotas
    set qgroup rsv to 0 on all subvols
ordered_extent finishes
    create delayed ref with rsv_bytes from before
run delayed ref
    record_simple_quota_delta
        free rsv_bytes (0 -> -rsv_delta)

results in us reliably underflowing the subvolume's qgroup rsv counter,
because disabling/re-enabling quotas toggles reservation counters down
to 0, but does not remove other file system state which represents
successful acquisition of qgroup rsv space. Specifically metadata rsv
counters on the root object and rsv_bytes on ordered_extent objects that
have released their reservation as well as the corresponding
QGROUP_RESERVED extent bits.

Normal qgroups gets away with this, I believe because it forces more
work to happen on transaction commit, but I am not certain it is totally
safe from the ordered_extent/leaked extent bit variant. Simple quotas
hits this reliably.

The intent of the fix is to make disable take the time to clear that
external to qgroups state as well: after flipping off the quota bit on
fs_info, flush delalloc and ordered extents, clearing the extent bits
along the way. This makes it so there are no ordered extents or meta
prealloc hanging around from the first enablement period during the second.

Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/qgroup.c | 47 ++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 44 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 558f66994667..18f521716e8d 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1248,6 +1248,40 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info,
 	return ret;
 }
 
+/*
+ * It is possible to have outstanding ordered extents
+ * which reserved bytes before we disabled. We need to fully flush
+ * delalloc, ordered extents, and a commit to ensure that
+ * we don't leak such reservations, only to have them come back
+ * if we re-enable.
+ *
+ * i.e.:
+ * enable simple quotas
+ * reserve space
+ * release it, store rsv_bytes in OE
+ * disable quotas
+ * enable simple quotas (qgroup rsv are all 0)
+ * OE finishes
+ * run delayed refs
+ * free rsv_bytes, resulting in miscounting or even underflow
+ */
+static int flush_reservations(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_trans_handle *trans;
+	int ret;
+
+	ret = btrfs_start_delalloc_roots(fs_info, LONG_MAX, false);
+	if (ret)
+		return ret;
+	btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1);
+	trans = btrfs_join_transaction(fs_info->tree_root);
+	if (IS_ERR(trans))
+		return PTR_ERR(trans);
+	btrfs_commit_transaction(trans);
+
+	return ret;
+}
+
 int btrfs_quota_disable(struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_root *quota_root;
@@ -1292,6 +1326,10 @@ int btrfs_quota_disable(struct btrfs_fs_info *fs_info)
 	clear_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags);
 	btrfs_qgroup_wait_for_completion(fs_info, false);
 
+	ret = flush_reservations(fs_info);
+	if (ret)
+		goto out;
+
 	/*
 	 * 1 For the root item
 	 *
@@ -1353,7 +1391,7 @@ int btrfs_quota_disable(struct btrfs_fs_info *fs_info)
 	if (ret && trans)
 		btrfs_end_transaction(trans);
 	else if (trans)
-		ret = btrfs_end_transaction(trans);
+		ret = btrfs_commit_transaction(trans);
 	mutex_unlock(&fs_info->cleaner_mutex);
 
 	return ret;
@@ -3957,8 +3995,11 @@ static int __btrfs_qgroup_release_data(struct btrfs_inode *inode,
 	int trace_op = QGROUP_RELEASE;
 	int ret;
 
-	if (btrfs_qgroup_mode(inode->root->fs_info) == BTRFS_QGROUP_MODE_DISABLED)
-		return 0;
+	if (btrfs_qgroup_mode(inode->root->fs_info) == BTRFS_QGROUP_MODE_DISABLED) {
+		extent_changeset_init(&changeset);
+		return clear_record_extent_bits(&inode->io_tree, start, start + len - 1,
+				       EXTENT_QGROUP_RESERVED, &changeset);
+	}
 
 	/* In release case, we shouldn't have @reserved */
 	WARN_ON(!free && reserved);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 06/18] btrfs: create qgroup earlier in snapshot creation
  2023-07-27 22:12 [PATCH v5 00/18] btrfs: simple quotas Boris Burkov
                   ` (4 preceding siblings ...)
  2023-07-27 22:12 ` [PATCH v5 05/18] btrfs: flush reservations during quota disable Boris Burkov
@ 2023-07-27 22:12 ` Boris Burkov
  2023-08-21 18:02   ` Josef Bacik
  2023-09-07 11:41   ` David Sterba
  2023-07-27 22:12 ` [PATCH v5 07/18] btrfs: function for recording simple quota deltas Boris Burkov
                   ` (12 subsequent siblings)
  18 siblings, 2 replies; 53+ messages in thread
From: Boris Burkov @ 2023-07-27 22:12 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Pull creating the qgroup earlier in the snapshot. This allows simple
quotas qgroups to see all the metadata writes related to the snapshot
being created and to be born with the root node accounted.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/qgroup.c      | 3 +++
 fs/btrfs/transaction.c | 6 ++++++
 2 files changed, 9 insertions(+)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 18f521716e8d..8e3a4ced3077 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1672,6 +1672,9 @@ int btrfs_create_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid)
 	struct btrfs_qgroup *qgroup;
 	int ret = 0;
 
+	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_DISABLED)
+		return 0;
+
 	mutex_lock(&fs_info->qgroup_ioctl_lock);
 	if (!fs_info->quota_root) {
 		ret = -ENOTCONN;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 89ff15aa085f..25217888e897 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1722,6 +1722,12 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
 	}
 	btrfs_release_path(path);
 
+	ret = btrfs_create_qgroup(trans, objectid);
+	if (ret) {
+		btrfs_abort_transaction(trans, ret);
+		goto fail;
+	}
+
 	/*
 	 * pull in the delayed directory update
 	 * and the delayed inode item
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 07/18] btrfs: function for recording simple quota deltas
  2023-07-27 22:12 [PATCH v5 00/18] btrfs: simple quotas Boris Burkov
                   ` (5 preceding siblings ...)
  2023-07-27 22:12 ` [PATCH v5 06/18] btrfs: create qgroup earlier in snapshot creation Boris Burkov
@ 2023-07-27 22:12 ` Boris Burkov
  2023-08-21 18:04   ` Josef Bacik
  2023-09-07 11:46   ` David Sterba
  2023-07-27 22:12 ` [PATCH v5 08/18] btrfs: rename tree_ref and data_ref owning_root Boris Burkov
                   ` (11 subsequent siblings)
  18 siblings, 2 replies; 53+ messages in thread
From: Boris Burkov @ 2023-07-27 22:12 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Rather than re-computing shared/exclusive ownership based on backrefs
and walking roots for implicit backrefs, simple quotas does an increment
when creating an extent and a decrement when deleting it. Add the API
for the extent item code to use to track those events.

Also add a helper function to make collecting parent qgroups in a ulist
easier for functions like this.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/qgroup.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/qgroup.h | 11 ++++++-
 2 files changed, 83 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 8e3a4ced3077..dedc532669f4 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -332,6 +332,35 @@ static int del_relation_rb(struct btrfs_fs_info *fs_info,
 	return -ENOENT;
 }
 
+static int qgroup_collect_parents(struct btrfs_qgroup *qgroup,
+				  struct ulist *ul)
+{
+	struct ulist_iterator uiter;
+	struct ulist_node *unode;
+	struct btrfs_qgroup_list *glist;
+	struct btrfs_qgroup *qg;
+	int ret = 0;
+
+	ulist_reinit(ul);
+	ret = ulist_add(ul, qgroup->qgroupid,
+			qgroup_to_aux(qgroup), GFP_ATOMIC);
+	if (ret < 0)
+		goto out;
+	ULIST_ITER_INIT(&uiter);
+	while ((unode = ulist_next(ul, &uiter))) {
+		qg = unode_aux_to_qgroup(unode);
+		list_for_each_entry(glist, &qg->groups, next_group) {
+			ret = ulist_add(ul, glist->group->qgroupid,
+					qgroup_to_aux(glist->group), GFP_ATOMIC);
+			if (ret < 0)
+				goto out;
+		}
+	}
+	ret = 0;
+out:
+	return ret;
+}
+
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 int btrfs_verify_qgroup_counts(struct btrfs_fs_info *fs_info, u64 qgroupid,
 			       u64 rfer, u64 excl)
@@ -4535,3 +4564,47 @@ void btrfs_qgroup_destroy_extent_records(struct btrfs_transaction *trans)
 	}
 	*root = RB_ROOT;
 }
+
+int btrfs_record_simple_quota_delta(struct btrfs_fs_info *fs_info,
+				    struct btrfs_simple_quota_delta *delta)
+{
+	int ret;
+	struct ulist *ul = fs_info->qgroup_ulist;
+	struct btrfs_qgroup *qgroup;
+	struct ulist_iterator uiter;
+	struct ulist_node *unode;
+	struct btrfs_qgroup *qg;
+	u64 root = delta->root;
+	u64 num_bytes = delta->num_bytes;
+	int sign = delta->is_inc ? 1 : -1;
+
+	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_SIMPLE)
+		return 0;
+
+	if (!is_fstree(root))
+		return 0;
+
+	spin_lock(&fs_info->qgroup_lock);
+	qgroup = find_qgroup_rb(fs_info, root);
+	if (!qgroup) {
+		ret = -ENOENT;
+		goto out;
+	}
+	ret = qgroup_collect_parents(qgroup, ul);
+	if (ret)
+		goto out;
+
+	ULIST_ITER_INIT(&uiter);
+	while ((unode = ulist_next(ul, &uiter))) {
+		qg = unode_aux_to_qgroup(unode);
+		qg->excl += num_bytes * sign;
+		qg->rfer += num_bytes * sign;
+		qgroup_dirty(fs_info, qg);
+	}
+
+out:
+	spin_unlock(&fs_info->qgroup_lock);
+	if (!ret && delta->rsv_bytes)
+		btrfs_qgroup_free_refroot(fs_info, root, delta->rsv_bytes, BTRFS_QGROUP_RSV_DATA);
+	return ret;
+}
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index d4c4d039585f..94d85b4fbebd 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -235,6 +235,14 @@ struct btrfs_qgroup {
 	struct kobject kobj;
 };
 
+struct btrfs_simple_quota_delta {
+	u64 root; /* The fstree root this delta counts against */
+	u64 num_bytes; /* The number of bytes in the extent being counted */
+	u64 rsv_bytes; /* The number of bytes reserved for this extent */
+	bool is_inc; /* Whether we are using or freeing the extent */
+	bool is_data; /* Whether the extent is data or metadata */
+};
+
 static inline u64 btrfs_qgroup_subvolid(u64 qgroupid)
 {
 	return (qgroupid & ((1ULL << BTRFS_QGROUP_LEVEL_SHIFT) - 1));
@@ -447,5 +455,6 @@ int btrfs_qgroup_trace_subtree_after_cow(struct btrfs_trans_handle *trans,
 		struct btrfs_root *root, struct extent_buffer *eb);
 void btrfs_qgroup_destroy_extent_records(struct btrfs_transaction *trans);
 bool btrfs_check_quota_leak(struct btrfs_fs_info *fs_info);
-
+int btrfs_record_simple_quota_delta(struct btrfs_fs_info *fs_info,
+				    struct btrfs_simple_quota_delta *delta);
 #endif
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 08/18] btrfs: rename tree_ref and data_ref owning_root
  2023-07-27 22:12 [PATCH v5 00/18] btrfs: simple quotas Boris Burkov
                   ` (6 preceding siblings ...)
  2023-07-27 22:12 ` [PATCH v5 07/18] btrfs: function for recording simple quota deltas Boris Burkov
@ 2023-07-27 22:12 ` Boris Burkov
  2023-07-27 22:12 ` [PATCH v5 09/18] btrfs: track owning root in btrfs_ref Boris Burkov
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 53+ messages in thread
From: Boris Burkov @ 2023-07-27 22:12 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

commit 113479d5b8eb ("btrfs: rename root fields in delayed refs structs")
changed these from ref_root to owning_root. However, there are many
circumstances where that name is not really accurate and the root on the
ref struct _is_ the referring root. In general, these are not the owning
root, though it does happen in some ref merging cases involving
overwrites during snapshots and similar.

Simple quotas cares quite a bit about tracking the original owner of an
extent through delayed refs, so rename these back to free up the name
for the real owning root (which will live on the generic btrfs_ref and
the head ref)

Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/delayed-ref.c | 10 +++++-----
 fs/btrfs/delayed-ref.h | 12 ++++++------
 fs/btrfs/extent-tree.c | 10 +++++-----
 fs/btrfs/ref-verify.c  |  4 ++--
 4 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index a9b938d3a531..f0bae1e1c455 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -885,7 +885,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
 	u64 parent = generic_ref->parent;
 	u8 ref_type;
 
-	is_system = (generic_ref->tree_ref.owning_root == BTRFS_CHUNK_TREE_OBJECTID);
+	is_system = (generic_ref->tree_ref.ref_root == BTRFS_CHUNK_TREE_OBJECTID);
 
 	ASSERT(generic_ref->type == BTRFS_REF_METADATA && generic_ref->action);
 	ref = kmem_cache_alloc(btrfs_delayed_tree_ref_cachep, GFP_NOFS);
@@ -914,14 +914,14 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
 		ref_type = BTRFS_TREE_BLOCK_REF_KEY;
 
 	init_delayed_ref_common(fs_info, &ref->node, bytenr, num_bytes,
-				generic_ref->tree_ref.owning_root, action,
+				generic_ref->tree_ref.ref_root, action,
 				ref_type);
-	ref->root = generic_ref->tree_ref.owning_root;
+	ref->root = generic_ref->tree_ref.ref_root;
 	ref->parent = parent;
 	ref->level = level;
 
 	init_delayed_ref_head(head_ref, record, bytenr, num_bytes,
-			      generic_ref->tree_ref.owning_root, 0, action,
+			      generic_ref->tree_ref.ref_root, 0, action,
 			      false, is_system);
 	head_ref->extent_op = extent_op;
 
@@ -974,7 +974,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
 	u64 bytenr = generic_ref->bytenr;
 	u64 num_bytes = generic_ref->len;
 	u64 parent = generic_ref->parent;
-	u64 ref_root = generic_ref->data_ref.owning_root;
+	u64 ref_root = generic_ref->data_ref.ref_root;
 	u64 owner = generic_ref->data_ref.ino;
 	u64 offset = generic_ref->data_ref.offset;
 	u8 ref_type;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index b8e14b0ba5f1..a71eff78469c 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -188,8 +188,8 @@ enum btrfs_ref_type {
 struct btrfs_data_ref {
 	/* For EXTENT_DATA_REF */
 
-	/* Original root this data extent belongs to */
-	u64 owning_root;
+	/* Root which owns this data reference */
+	u64 ref_root;
 
 	/* Inode which refers to this data extent */
 	u64 ino;
@@ -212,11 +212,11 @@ struct btrfs_tree_ref {
 	int level;
 
 	/*
-	 * Root which owns this tree block.
+	 * Root which owns this tree block reference.
 	 *
 	 * For TREE_BLOCK_REF (skinny metadata, either inline or keyed)
 	 */
-	u64 owning_root;
+	u64 ref_root;
 
 	/* For non-skinny metadata, no special member needed */
 };
@@ -294,7 +294,7 @@ static inline void btrfs_init_tree_ref(struct btrfs_ref *generic_ref,
 	generic_ref->real_root = mod_root ?: root;
 #endif
 	generic_ref->tree_ref.level = level;
-	generic_ref->tree_ref.owning_root = root;
+	generic_ref->tree_ref.ref_root = root;
 	generic_ref->type = BTRFS_REF_METADATA;
 	if (skip_qgroup || !(is_fstree(root) &&
 			     (!mod_root || is_fstree(mod_root))))
@@ -312,7 +312,7 @@ static inline void btrfs_init_data_ref(struct btrfs_ref *generic_ref,
 	/* If @real_root not set, use @root as fallback */
 	generic_ref->real_root = mod_root ?: ref_root;
 #endif
-	generic_ref->data_ref.owning_root = ref_root;
+	generic_ref->data_ref.ref_root = ref_root;
 	generic_ref->data_ref.ino = ino;
 	generic_ref->data_ref.offset = offset;
 	generic_ref->type = BTRFS_REF_DATA;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 04ceb9d25d3e..018e288ccf7d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1365,7 +1365,7 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 	ASSERT(generic_ref->type != BTRFS_REF_NOT_SET &&
 	       generic_ref->action);
 	BUG_ON(generic_ref->type == BTRFS_REF_METADATA &&
-	       generic_ref->tree_ref.owning_root == BTRFS_TREE_LOG_OBJECTID);
+	       generic_ref->tree_ref.ref_root == BTRFS_TREE_LOG_OBJECTID);
 
 	if (generic_ref->type == BTRFS_REF_METADATA)
 		ret = btrfs_add_delayed_tree_ref(trans, generic_ref, NULL);
@@ -3307,9 +3307,9 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_ref *ref)
 	 * tree, just update pinning info and exit early.
 	 */
 	if ((ref->type == BTRFS_REF_METADATA &&
-	     ref->tree_ref.owning_root == BTRFS_TREE_LOG_OBJECTID) ||
+	     ref->tree_ref.ref_root == BTRFS_TREE_LOG_OBJECTID) ||
 	    (ref->type == BTRFS_REF_DATA &&
-	     ref->data_ref.owning_root == BTRFS_TREE_LOG_OBJECTID)) {
+	     ref->data_ref.ref_root == BTRFS_TREE_LOG_OBJECTID)) {
 		/* unlocks the pinned mutex */
 		btrfs_pin_extent(trans, ref->bytenr, ref->len, 1);
 		ret = 0;
@@ -3320,9 +3320,9 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_ref *ref)
 	}
 
 	if (!((ref->type == BTRFS_REF_METADATA &&
-	       ref->tree_ref.owning_root == BTRFS_TREE_LOG_OBJECTID) ||
+	       ref->tree_ref.ref_root == BTRFS_TREE_LOG_OBJECTID) ||
 	      (ref->type == BTRFS_REF_DATA &&
-	       ref->data_ref.owning_root == BTRFS_TREE_LOG_OBJECTID)))
+	       ref->data_ref.ref_root == BTRFS_TREE_LOG_OBJECTID)))
 		btrfs_ref_tree_mod(fs_info, ref);
 
 	return ret;
diff --git a/fs/btrfs/ref-verify.c b/fs/btrfs/ref-verify.c
index 95d28497de7c..b7b3bd86f5e2 100644
--- a/fs/btrfs/ref-verify.c
+++ b/fs/btrfs/ref-verify.c
@@ -681,10 +681,10 @@ int btrfs_ref_tree_mod(struct btrfs_fs_info *fs_info,
 
 	if (generic_ref->type == BTRFS_REF_METADATA) {
 		if (!parent)
-			ref_root = generic_ref->tree_ref.owning_root;
+			ref_root = generic_ref->tree_ref.ref_root;
 		owner = generic_ref->tree_ref.level;
 	} else if (!parent) {
-		ref_root = generic_ref->data_ref.owning_root;
+		ref_root = generic_ref->data_ref.ref_root;
 		owner = generic_ref->data_ref.ino;
 		offset = generic_ref->data_ref.offset;
 	}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 09/18] btrfs: track owning root in btrfs_ref
  2023-07-27 22:12 [PATCH v5 00/18] btrfs: simple quotas Boris Burkov
                   ` (7 preceding siblings ...)
  2023-07-27 22:12 ` [PATCH v5 08/18] btrfs: rename tree_ref and data_ref owning_root Boris Burkov
@ 2023-07-27 22:12 ` Boris Burkov
  2023-08-21 18:05   ` Josef Bacik
  2023-07-27 22:12 ` [PATCH v5 10/18] btrfs: track original extent owner in head_ref Boris Burkov
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 53+ messages in thread
From: Boris Burkov @ 2023-07-27 22:12 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

While data extents require us to store additional inline refs to track
the original owner on free, this information is available implicitly for
metadata. It is found in the owner field of the header of the tree
block. Even if other trees refer to this block and the original ref goes
away, we will not rewrite that header field, so it will reliably give the
original owner.

In addition, there is a relocation case where a new data extent needs to
have an owning root separate from the referring root wired through
delayed refs.

To use it for recording simple quota deltas, we need to wire this root
id through from when we create the delayed ref until we fully process
it. Store it in the generic btrfs_ref struct of the delayed ref.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/delayed-ref.h |  7 +++++--
 fs/btrfs/extent-tree.c | 19 +++++++++++--------
 fs/btrfs/file.c        | 10 +++++-----
 fs/btrfs/inode-item.c  |  2 +-
 fs/btrfs/relocation.c  | 17 ++++++++++-------
 fs/btrfs/tree-log.c    |  3 ++-
 6 files changed, 34 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index a71eff78469c..0729850a9193 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -239,6 +239,7 @@ struct btrfs_ref {
 #endif
 	u64 bytenr;
 	u64 len;
+	u64 owning_root;
 
 	/* Bytenr of the parent tree block */
 	u64 parent;
@@ -278,16 +279,18 @@ static inline u64 btrfs_calc_delayed_ref_bytes(const struct btrfs_fs_info *fs_in
 }
 
 static inline void btrfs_init_generic_ref(struct btrfs_ref *generic_ref,
-				int action, u64 bytenr, u64 len, u64 parent)
+				int action, u64 bytenr, u64 len, u64 parent, u64 owning_root)
 {
 	generic_ref->action = action;
 	generic_ref->bytenr = bytenr;
 	generic_ref->len = len;
 	generic_ref->parent = parent;
+	generic_ref->owning_root = owning_root;
 }
 
 static inline void btrfs_init_tree_ref(struct btrfs_ref *generic_ref,
-				int level, u64 root, u64 mod_root, bool skip_qgroup)
+				int level, u64 root, u64 mod_root,
+				bool skip_qgroup)
 {
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	/* If @real_root not set, use @root as fallback */
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 018e288ccf7d..4f0115553cd3 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2389,7 +2389,7 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle *trans,
 			num_bytes = btrfs_file_extent_disk_num_bytes(buf, fi);
 			key.offset -= btrfs_file_extent_offset(buf, fi);
 			btrfs_init_generic_ref(&generic_ref, action, bytenr,
-					       num_bytes, parent);
+					       num_bytes, parent, ref_root);
 			btrfs_init_data_ref(&generic_ref, ref_root, key.objectid,
 					    key.offset, root->root_key.objectid,
 					    for_reloc);
@@ -2402,8 +2402,9 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle *trans,
 		} else {
 			bytenr = btrfs_node_blockptr(buf, i);
 			num_bytes = fs_info->nodesize;
+			/* We don't know the owning_root, use 0 */
 			btrfs_init_generic_ref(&generic_ref, action, bytenr,
-					       num_bytes, parent);
+					       num_bytes, parent, 0);
 			btrfs_init_tree_ref(&generic_ref, level - 1, ref_root,
 					    root->root_key.objectid, for_reloc);
 			if (inc)
@@ -3220,7 +3221,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 	int ret;
 
 	btrfs_init_generic_ref(&generic_ref, BTRFS_DROP_DELAYED_REF,
-			       buf->start, buf->len, parent);
+			       buf->start, buf->len, parent, btrfs_header_owner(buf));
 	btrfs_init_tree_ref(&generic_ref, btrfs_header_level(buf),
 			    root_id, 0, false);
 
@@ -4677,12 +4678,14 @@ int btrfs_alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
 				     struct btrfs_key *ins)
 {
 	struct btrfs_ref generic_ref = { 0 };
+	u64 root_objectid = root->root_key.objectid;
+	u64 owning_root = root_objectid;
 
-	BUG_ON(root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID);
+	BUG_ON(root_objectid == BTRFS_TREE_LOG_OBJECTID);
 
 	btrfs_init_generic_ref(&generic_ref, BTRFS_ADD_DELAYED_EXTENT,
-			       ins->objectid, ins->offset, 0);
-	btrfs_init_data_ref(&generic_ref, root->root_key.objectid, owner,
+			       ins->objectid, ins->offset, 0, owning_root);
+	btrfs_init_data_ref(&generic_ref, root_objectid, owner,
 			    offset, 0, false);
 	btrfs_ref_tree_mod(root->fs_info, &generic_ref);
 
@@ -4894,7 +4897,7 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
 		extent_op->level = level;
 
 		btrfs_init_generic_ref(&generic_ref, BTRFS_ADD_DELAYED_EXTENT,
-				       ins.objectid, ins.offset, parent);
+				       ins.objectid, ins.offset, parent, btrfs_header_owner(buf));
 		btrfs_init_tree_ref(&generic_ref, level, root_objectid,
 				    root->root_key.objectid, false);
 		btrfs_ref_tree_mod(fs_info, &generic_ref);
@@ -5315,7 +5318,7 @@ static noinline int do_walk_down(struct btrfs_trans_handle *trans,
 		find_next_key(path, level, &wc->drop_progress);
 
 		btrfs_init_generic_ref(&ref, BTRFS_DROP_DELAYED_REF, bytenr,
-				       fs_info->nodesize, parent);
+				       fs_info->nodesize, parent, btrfs_header_owner(next));
 		btrfs_init_tree_ref(&ref, level - 1, root->root_key.objectid,
 				    0, false);
 		ret = btrfs_free_extent(trans, &ref);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index fd03e689a6be..83b651d823bb 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -373,7 +373,7 @@ int btrfs_drop_extents(struct btrfs_trans_handle *trans,
 			if (update_refs && disk_bytenr > 0) {
 				btrfs_init_generic_ref(&ref,
 						BTRFS_ADD_DELAYED_REF,
-						disk_bytenr, num_bytes, 0);
+						disk_bytenr, num_bytes, 0, root->root_key.objectid);
 				btrfs_init_data_ref(&ref,
 						root->root_key.objectid,
 						new_key.objectid,
@@ -463,7 +463,7 @@ int btrfs_drop_extents(struct btrfs_trans_handle *trans,
 			} else if (update_refs && disk_bytenr > 0) {
 				btrfs_init_generic_ref(&ref,
 						BTRFS_DROP_DELAYED_REF,
-						disk_bytenr, num_bytes, 0);
+						disk_bytenr, num_bytes, 0, root->root_key.objectid);
 				btrfs_init_data_ref(&ref,
 						root->root_key.objectid,
 						key.objectid,
@@ -745,7 +745,7 @@ int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
 		btrfs_mark_buffer_dirty(leaf);
 
 		btrfs_init_generic_ref(&ref, BTRFS_ADD_DELAYED_REF, bytenr,
-				       num_bytes, 0);
+				       num_bytes, 0, root->root_key.objectid);
 		btrfs_init_data_ref(&ref, root->root_key.objectid, ino,
 				    orig_offset, 0, false);
 		ret = btrfs_inc_extent_ref(trans, &ref);
@@ -771,7 +771,7 @@ int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
 	other_start = end;
 	other_end = 0;
 	btrfs_init_generic_ref(&ref, BTRFS_DROP_DELAYED_REF, bytenr,
-			       num_bytes, 0);
+			       num_bytes, 0, root->root_key.objectid);
 	btrfs_init_data_ref(&ref, root->root_key.objectid, ino, orig_offset,
 			    0, false);
 	if (extent_mergeable(leaf, path->slots[0] + 1,
@@ -2290,7 +2290,7 @@ static int btrfs_insert_replace_extent(struct btrfs_trans_handle *trans,
 
 		btrfs_init_generic_ref(&ref, BTRFS_ADD_DELAYED_REF,
 				       extent_info->disk_offset,
-				       extent_info->disk_len, 0);
+				       extent_info->disk_len, 0, root->root_key.objectid);
 		ref_offset = extent_info->file_offset - extent_info->data_offset;
 		btrfs_init_data_ref(&ref, root->root_key.objectid,
 				    btrfs_ino(inode), ref_offset, 0, false);
diff --git a/fs/btrfs/inode-item.c b/fs/btrfs/inode-item.c
index 4c322b720a80..4a56bf679de6 100644
--- a/fs/btrfs/inode-item.c
+++ b/fs/btrfs/inode-item.c
@@ -676,7 +676,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
 			bytes_deleted += extent_num_bytes;
 
 			btrfs_init_generic_ref(&ref, BTRFS_DROP_DELAYED_REF,
-					extent_start, extent_num_bytes, 0);
+					extent_start, extent_num_bytes, 0, root->root_key.objectid);
 			btrfs_init_data_ref(&ref, btrfs_header_owner(leaf),
 					control->ino, extent_offset,
 					root->root_key.objectid, false);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 9db2e6fa2cb2..3161a48d5970 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1158,7 +1158,7 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
 
 		key.offset -= btrfs_file_extent_offset(leaf, fi);
 		btrfs_init_generic_ref(&ref, BTRFS_ADD_DELAYED_REF, new_bytenr,
-				       num_bytes, parent);
+				       num_bytes, parent, root->root_key.objectid);
 		btrfs_init_data_ref(&ref, btrfs_header_owner(leaf),
 				    key.objectid, key.offset,
 				    root->root_key.objectid, false);
@@ -1169,7 +1169,7 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
 		}
 
 		btrfs_init_generic_ref(&ref, BTRFS_DROP_DELAYED_REF, bytenr,
-				       num_bytes, parent);
+				       num_bytes, parent, root->root_key.objectid);
 		btrfs_init_data_ref(&ref, btrfs_header_owner(leaf),
 				    key.objectid, key.offset,
 				    root->root_key.objectid, false);
@@ -1382,7 +1382,8 @@ int replace_path(struct btrfs_trans_handle *trans, struct reloc_control *rc,
 		btrfs_mark_buffer_dirty(path->nodes[level]);
 
 		btrfs_init_generic_ref(&ref, BTRFS_ADD_DELAYED_REF, old_bytenr,
-				       blocksize, path->nodes[level]->start);
+				       blocksize, path->nodes[level]->start,
+				       src->root_key.objectid);
 		btrfs_init_tree_ref(&ref, level - 1, src->root_key.objectid,
 				    0, true);
 		ret = btrfs_inc_extent_ref(trans, &ref);
@@ -1391,7 +1392,7 @@ int replace_path(struct btrfs_trans_handle *trans, struct reloc_control *rc,
 			break;
 		}
 		btrfs_init_generic_ref(&ref, BTRFS_ADD_DELAYED_REF, new_bytenr,
-				       blocksize, 0);
+				       blocksize, 0, dest->root_key.objectid);
 		btrfs_init_tree_ref(&ref, level - 1, dest->root_key.objectid, 0,
 				    true);
 		ret = btrfs_inc_extent_ref(trans, &ref);
@@ -1400,8 +1401,9 @@ int replace_path(struct btrfs_trans_handle *trans, struct reloc_control *rc,
 			break;
 		}
 
+		/* We don't know the real owning_root, use 0 */
 		btrfs_init_generic_ref(&ref, BTRFS_DROP_DELAYED_REF, new_bytenr,
-				       blocksize, path->nodes[level]->start);
+				       blocksize, path->nodes[level]->start, 0);
 		btrfs_init_tree_ref(&ref, level - 1, src->root_key.objectid,
 				    0, true);
 		ret = btrfs_free_extent(trans, &ref);
@@ -1410,8 +1412,9 @@ int replace_path(struct btrfs_trans_handle *trans, struct reloc_control *rc,
 			break;
 		}
 
+		/* We don't know the real owning_root, use 0 */
 		btrfs_init_generic_ref(&ref, BTRFS_DROP_DELAYED_REF, old_bytenr,
-				       blocksize, 0);
+				       blocksize, 0, 0);
 		btrfs_init_tree_ref(&ref, level - 1, dest->root_key.objectid,
 				    0, true);
 		ret = btrfs_free_extent(trans, &ref);
@@ -2491,7 +2494,7 @@ static int do_relocation(struct btrfs_trans_handle *trans,
 
 			btrfs_init_generic_ref(&ref, BTRFS_ADD_DELAYED_REF,
 					       node->eb->start, blocksize,
-					       upper->eb->start);
+					       upper->eb->start, btrfs_header_owner(upper->eb));
 			btrfs_init_tree_ref(&ref, node->level,
 					    btrfs_header_owner(upper->eb),
 					    root->root_key.objectid, false);
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 8ad7e7e38d18..51aaaefaf39d 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -767,7 +767,8 @@ static noinline int replay_one_extent(struct btrfs_trans_handle *trans,
 			} else if (ret == 0) {
 				btrfs_init_generic_ref(&ref,
 						BTRFS_ADD_DELAYED_REF,
-						ins.objectid, ins.offset, 0);
+						ins.objectid, ins.offset, 0,
+						root->root_key.objectid);
 				btrfs_init_data_ref(&ref,
 						root->root_key.objectid,
 						key->objectid, offset, 0, false);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 10/18] btrfs: track original extent owner in head_ref
  2023-07-27 22:12 [PATCH v5 00/18] btrfs: simple quotas Boris Burkov
                   ` (8 preceding siblings ...)
  2023-07-27 22:12 ` [PATCH v5 09/18] btrfs: track owning root in btrfs_ref Boris Burkov
@ 2023-07-27 22:12 ` Boris Burkov
  2023-08-21 18:06   ` Josef Bacik
  2023-09-07 11:54   ` David Sterba
  2023-07-27 22:12 ` [PATCH v5 11/18] btrfs: new inline ref storing owning subvol of data extents Boris Burkov
                   ` (8 subsequent siblings)
  18 siblings, 2 replies; 53+ messages in thread
From: Boris Burkov @ 2023-07-27 22:12 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Simple quotas requires tracking the original creating root of any given
extent. This gets complicated when multiple subvolumes create
overlapping/contradictory refs in the same transaction. For example,
due to modifying or deleting an extent while also snapshotting it.

To resolve this in a general way, take advantage of the fact that we are
essentially already tracking this for handling releasing reservations.
The head ref coalesces the various refs and uses must_insert_reserved to
check if it needs to create an extent/free reservation. Store the ref
that set must_insert_reserved as the owning ref on the head ref.

Note that this can result in writing an extent for the very first time
with an owner different from its only ref, but it will look the same as
if you first created it with the original owning ref, then added the
other ref, then removed the owning ref.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/delayed-ref.c | 20 ++++++++++++++++----
 fs/btrfs/delayed-ref.h |  7 +++++++
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index f0bae1e1c455..28ba7a9eb3c3 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -623,6 +623,16 @@ static noinline void update_existing_head_ref(struct btrfs_trans_handle *trans,
 	BUG_ON(existing->is_data != update->is_data);
 
 	spin_lock(&existing->lock);
+
+	/*
+	 * When freeing an extent, we may not know the owning root
+	 * when we first create the head_ref. However, some deref before the
+	 * last deref will know it, so we just need to update the head_ref
+	 * accordingly
+	 */
+	if (!existing->owning_root)
+		existing->owning_root = update->owning_root;
+
 	if (update->must_insert_reserved) {
 		/* if the extent was freed and then
 		 * reallocated before the delayed ref
@@ -632,6 +642,7 @@ static noinline void update_existing_head_ref(struct btrfs_trans_handle *trans,
 		 * Set it again here
 		 */
 		existing->must_insert_reserved = update->must_insert_reserved;
+		existing->owning_root = update->owning_root;
 
 		/*
 		 * update the num_bytes so we make sure the accounting
@@ -694,7 +705,7 @@ static void init_delayed_ref_head(struct btrfs_delayed_ref_head *head_ref,
 				  struct btrfs_qgroup_extent_record *qrecord,
 				  u64 bytenr, u64 num_bytes, u64 ref_root,
 				  u64 reserved, int action, bool is_data,
-				  bool is_system)
+				  bool is_system, u64 owning_root)
 {
 	int count_mod = 1;
 	bool must_insert_reserved = false;
@@ -735,6 +746,7 @@ static void init_delayed_ref_head(struct btrfs_delayed_ref_head *head_ref,
 	head_ref->num_bytes = num_bytes;
 	head_ref->ref_mod = count_mod;
 	head_ref->must_insert_reserved = must_insert_reserved;
+	head_ref->owning_root = owning_root;
 	head_ref->is_data = is_data;
 	head_ref->is_system = is_system;
 	head_ref->ref_tree = RB_ROOT_CACHED;
@@ -922,7 +934,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
 
 	init_delayed_ref_head(head_ref, record, bytenr, num_bytes,
 			      generic_ref->tree_ref.ref_root, 0, action,
-			      false, is_system);
+			      false, is_system, generic_ref->owning_root);
 	head_ref->extent_op = extent_op;
 
 	delayed_refs = &trans->transaction->delayed_refs;
@@ -1014,7 +1026,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
 	}
 
 	init_delayed_ref_head(head_ref, record, bytenr, num_bytes, ref_root,
-			      reserved, action, true, false);
+			      reserved, action, true, false, generic_ref->owning_root);
 	head_ref->extent_op = NULL;
 
 	delayed_refs = &trans->transaction->delayed_refs;
@@ -1060,7 +1072,7 @@ int btrfs_add_delayed_extent_op(struct btrfs_trans_handle *trans,
 		return -ENOMEM;
 
 	init_delayed_ref_head(head_ref, NULL, bytenr, num_bytes, 0, 0,
-			      BTRFS_UPDATE_DELAYED_HEAD, false, false);
+			      BTRFS_UPDATE_DELAYED_HEAD, false, false, 0);
 	head_ref->extent_op = extent_op;
 
 	delayed_refs = &trans->transaction->delayed_refs;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 0729850a9193..71f0a6e5d583 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -117,6 +117,13 @@ struct btrfs_delayed_ref_head {
 	 * the free has happened.
 	 */
 	bool must_insert_reserved;
+
+	/*
+	 * The root which triggered the allocation when
+	 * must_insert_reserved is true
+	 */
+	u64 owning_root;
+
 	bool is_data;
 	bool is_system;
 	bool processing;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 11/18] btrfs: new inline ref storing owning subvol of data extents
  2023-07-27 22:12 [PATCH v5 00/18] btrfs: simple quotas Boris Burkov
                   ` (9 preceding siblings ...)
  2023-07-27 22:12 ` [PATCH v5 10/18] btrfs: track original extent owner in head_ref Boris Burkov
@ 2023-07-27 22:12 ` Boris Burkov
  2023-08-21 18:07   ` Josef Bacik
  2023-09-07 12:06   ` David Sterba
  2023-07-27 22:12 ` [PATCH v5 12/18] btrfs: inline owner ref lookup helper Boris Burkov
                   ` (7 subsequent siblings)
  18 siblings, 2 replies; 53+ messages in thread
From: Boris Burkov @ 2023-07-27 22:12 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

In order to implement simple quota groups, we need to be able to
associate a data extent with the subvolume that created it. Once you
account for reflink, this information cannot be recovered without
explicitly storing it. Options for storing it are:
- a new key/item
- a new extent inline ref item

The former is backwards compatible, but wastes space, the latter is
incompat, but is efficient in space and reuses the existing inline ref
machinery, while only abusing it a tiny amount -- specifically, the new
item is not a ref, per-se.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/accessors.h            |  4 +++
 fs/btrfs/backref.c              |  3 ++
 fs/btrfs/extent-tree.c          | 56 ++++++++++++++++++++++++++-------
 fs/btrfs/print-tree.c           | 12 +++++++
 fs/btrfs/ref-verify.c           |  3 ++
 fs/btrfs/tree-checker.c         |  3 ++
 include/uapi/linux/btrfs_tree.h |  6 ++++
 7 files changed, 76 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
index 8cfc8214109c..a23045c05937 100644
--- a/fs/btrfs/accessors.h
+++ b/fs/btrfs/accessors.h
@@ -349,6 +349,8 @@ BTRFS_SETGET_FUNCS(extent_data_ref_count, struct btrfs_extent_data_ref, count, 3
 
 BTRFS_SETGET_FUNCS(shared_data_ref_count, struct btrfs_shared_data_ref, count, 32);
 
+BTRFS_SETGET_FUNCS(extent_owner_ref_root_id, struct btrfs_extent_owner_ref, root_id, 64);
+
 BTRFS_SETGET_FUNCS(extent_inline_ref_type, struct btrfs_extent_inline_ref,
 		   type, 8);
 BTRFS_SETGET_FUNCS(extent_inline_ref_offset, struct btrfs_extent_inline_ref,
@@ -365,6 +367,8 @@ static inline u32 btrfs_extent_inline_ref_size(int type)
 	if (type == BTRFS_EXTENT_DATA_REF_KEY)
 		return sizeof(struct btrfs_extent_data_ref) +
 		       offsetof(struct btrfs_extent_inline_ref, offset);
+	if (type == BTRFS_EXTENT_OWNER_REF_KEY)
+		return sizeof(struct btrfs_extent_inline_ref);
 	return 0;
 }
 
diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 79336fa853db..d5bb6a880713 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1129,6 +1129,9 @@ static int add_inline_refs(struct btrfs_backref_walk_ctx *ctx,
 						       count, sc, GFP_NOFS);
 			break;
 		}
+		case BTRFS_EXTENT_OWNER_REF_KEY:
+			WARN_ON(!btrfs_fs_incompat(ctx->fs_info, SIMPLE_QUOTA));
+			break;
 		default:
 			WARN_ON(1);
 		}
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4f0115553cd3..c6d537bf5ad4 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -342,9 +342,13 @@ int btrfs_get_extent_inline_ref_type(const struct extent_buffer *eb,
 				     struct btrfs_extent_inline_ref *iref,
 				     enum btrfs_inline_ref_type is_data)
 {
+	struct btrfs_fs_info *fs_info = eb->fs_info;
 	int type = btrfs_extent_inline_ref_type(eb, iref);
 	u64 offset = btrfs_extent_inline_ref_offset(eb, iref);
 
+	if (type == BTRFS_EXTENT_OWNER_REF_KEY && btrfs_fs_incompat(fs_info, SIMPLE_QUOTA))
+		return type;
+
 	if (type == BTRFS_TREE_BLOCK_REF_KEY ||
 	    type == BTRFS_SHARED_BLOCK_REF_KEY ||
 	    type == BTRFS_SHARED_DATA_REF_KEY ||
@@ -353,26 +357,25 @@ int btrfs_get_extent_inline_ref_type(const struct extent_buffer *eb,
 			if (type == BTRFS_TREE_BLOCK_REF_KEY)
 				return type;
 			if (type == BTRFS_SHARED_BLOCK_REF_KEY) {
-				ASSERT(eb->fs_info);
+				ASSERT(fs_info);
 				/*
 				 * Every shared one has parent tree block,
 				 * which must be aligned to sector size.
 				 */
-				if (offset &&
-				    IS_ALIGNED(offset, eb->fs_info->sectorsize))
+				if (offset && IS_ALIGNED(offset, fs_info->sectorsize))
 					return type;
 			}
 		} else if (is_data == BTRFS_REF_TYPE_DATA) {
 			if (type == BTRFS_EXTENT_DATA_REF_KEY)
 				return type;
 			if (type == BTRFS_SHARED_DATA_REF_KEY) {
-				ASSERT(eb->fs_info);
+				ASSERT(fs_info);
 				/*
 				 * Every shared one has parent tree block,
 				 * which must be aligned to sector size.
 				 */
 				if (offset &&
-				    IS_ALIGNED(offset, eb->fs_info->sectorsize))
+				    IS_ALIGNED(offset, fs_info->sectorsize))
 					return type;
 			}
 		} else {
@@ -382,7 +385,7 @@ int btrfs_get_extent_inline_ref_type(const struct extent_buffer *eb,
 	}
 
 	btrfs_print_leaf(eb);
-	btrfs_err(eb->fs_info,
+	btrfs_err(fs_info,
 		  "eb %llu iref 0x%lx invalid extent inline ref type %d",
 		  eb->start, (unsigned long)iref, type);
 	WARN_ON(1);
@@ -891,6 +894,11 @@ int lookup_inline_extent_backref(struct btrfs_trans_handle *trans,
 		}
 		iref = (struct btrfs_extent_inline_ref *)ptr;
 		type = btrfs_get_extent_inline_ref_type(leaf, iref, needed);
+		if (type == BTRFS_EXTENT_OWNER_REF_KEY) {
+			WARN_ON(!btrfs_fs_incompat(fs_info, SIMPLE_QUOTA));
+			ptr += btrfs_extent_inline_ref_size(type);
+			continue;
+		}
 		if (type == BTRFS_REF_TYPE_INVALID) {
 			err = -EUCLEAN;
 			goto out;
@@ -1684,6 +1692,8 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
 		 node->type == BTRFS_SHARED_DATA_REF_KEY)
 		ret = run_delayed_data_ref(trans, node, extent_op,
 					   insert_reserved);
+	else if (node->type == BTRFS_EXTENT_OWNER_REF_KEY)
+		ret = 0;
 	else
 		BUG();
 	if (ret && insert_reserved)
@@ -2250,6 +2260,7 @@ static noinline int check_committed_ref(struct btrfs_root *root,
 	struct btrfs_extent_item *ei;
 	struct btrfs_key key;
 	u32 item_size;
+	u32 expected_size;
 	int type;
 	int ret;
 
@@ -2276,10 +2287,22 @@ static noinline int check_committed_ref(struct btrfs_root *root,
 	ret = 1;
 	item_size = btrfs_item_size(leaf, path->slots[0]);
 	ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item);
+	expected_size = sizeof(*ei) + btrfs_extent_inline_ref_size(BTRFS_EXTENT_DATA_REF_KEY);
+
+	/* No inline refs; we need to bail before checking for owner ref */
+	if (item_size == sizeof(*ei))
+		goto out;
+
+	/* Check for an owner ref; skip over it to the real inline refs */
+	iref = (struct btrfs_extent_inline_ref *)(ei + 1);
+	type = btrfs_get_extent_inline_ref_type(leaf, iref, BTRFS_REF_TYPE_DATA);
+	if (btrfs_fs_incompat(fs_info, SIMPLE_QUOTA) && type == BTRFS_EXTENT_OWNER_REF_KEY) {
+		expected_size += btrfs_extent_inline_ref_size(BTRFS_EXTENT_OWNER_REF_KEY);
+		iref = (struct btrfs_extent_inline_ref *)(iref + 1);
+	}
 
 	/* If extent item has more than 1 inline ref then it's shared */
-	if (item_size != sizeof(*ei) +
-	    btrfs_extent_inline_ref_size(BTRFS_EXTENT_DATA_REF_KEY))
+	if (item_size != expected_size)
 		goto out;
 
 	/*
@@ -2291,8 +2314,6 @@ static noinline int check_committed_ref(struct btrfs_root *root,
 	     btrfs_root_last_snapshot(&root->root_item)))
 		goto out;
 
-	iref = (struct btrfs_extent_inline_ref *)(ei + 1);
-
 	/* If this extent has SHARED_DATA_REF then it's shared */
 	type = btrfs_get_extent_inline_ref_type(leaf, iref, BTRFS_REF_TYPE_DATA);
 	if (type != BTRFS_EXTENT_DATA_REF_KEY)
@@ -4543,18 +4564,23 @@ static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
 	struct btrfs_root *extent_root;
 	int ret;
 	struct btrfs_extent_item *extent_item;
+	struct btrfs_extent_owner_ref *oref;
 	struct btrfs_extent_inline_ref *iref;
 	struct btrfs_path *path;
 	struct extent_buffer *leaf;
 	int type;
 	u32 size;
+	bool simple_quota = btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE;
 
 	if (parent > 0)
 		type = BTRFS_SHARED_DATA_REF_KEY;
 	else
 		type = BTRFS_EXTENT_DATA_REF_KEY;
 
-	size = sizeof(*extent_item) + btrfs_extent_inline_ref_size(type);
+	size = sizeof(*extent_item);
+	if (simple_quota)
+		size += btrfs_extent_inline_ref_size(BTRFS_EXTENT_OWNER_REF_KEY);
+	size += btrfs_extent_inline_ref_size(type);
 
 	path = btrfs_alloc_path();
 	if (!path)
@@ -4575,8 +4601,16 @@ static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
 	btrfs_set_extent_flags(leaf, extent_item,
 			       flags | BTRFS_EXTENT_FLAG_DATA);
 
+
 	iref = (struct btrfs_extent_inline_ref *)(extent_item + 1);
+	if (simple_quota) {
+		btrfs_set_extent_inline_ref_type(leaf, iref, BTRFS_EXTENT_OWNER_REF_KEY);
+		oref = (struct btrfs_extent_owner_ref *)(&iref->offset);
+		btrfs_set_extent_owner_ref_root_id(leaf, oref, root_objectid);
+		iref = (struct btrfs_extent_inline_ref *)(oref + 1);
+	}
 	btrfs_set_extent_inline_ref_type(leaf, iref, type);
+
 	if (parent > 0) {
 		struct btrfs_shared_data_ref *ref;
 		ref = (struct btrfs_shared_data_ref *)(iref + 1);
diff --git a/fs/btrfs/print-tree.c b/fs/btrfs/print-tree.c
index aa06d9ca911d..3fac15ce0db0 100644
--- a/fs/btrfs/print-tree.c
+++ b/fs/btrfs/print-tree.c
@@ -80,12 +80,20 @@ static void print_extent_data_ref(const struct extent_buffer *eb,
 	       btrfs_extent_data_ref_count(eb, ref));
 }
 
+static void print_extent_owner_ref(const struct extent_buffer *eb,
+				   struct btrfs_extent_owner_ref *ref)
+{
+	WARN_ON(!btrfs_fs_incompat(eb->fs_info, SIMPLE_QUOTA));
+	pr_cont("extent data owner root %llu\n", btrfs_extent_owner_ref_root_id(eb, ref));
+}
+
 static void print_extent_item(const struct extent_buffer *eb, int slot, int type)
 {
 	struct btrfs_extent_item *ei;
 	struct btrfs_extent_inline_ref *iref;
 	struct btrfs_extent_data_ref *dref;
 	struct btrfs_shared_data_ref *sref;
+	struct btrfs_extent_owner_ref *oref;
 	struct btrfs_disk_key key;
 	unsigned long end;
 	unsigned long ptr;
@@ -159,6 +167,10 @@ static void print_extent_item(const struct extent_buffer *eb, int slot, int type
 			"\t\t\t(parent %llu not aligned to sectorsize %u)\n",
 				     offset, eb->fs_info->sectorsize);
 			break;
+		case BTRFS_EXTENT_OWNER_REF_KEY:
+			oref = (struct btrfs_extent_owner_ref *)(&iref->offset);
+			print_extent_owner_ref(eb, oref);
+			break;
 		default:
 			pr_cont("(extent %llu has INVALID ref type %d)\n",
 				  eb->start, type);
diff --git a/fs/btrfs/ref-verify.c b/fs/btrfs/ref-verify.c
index b7b3bd86f5e2..c0660233feb4 100644
--- a/fs/btrfs/ref-verify.c
+++ b/fs/btrfs/ref-verify.c
@@ -485,6 +485,9 @@ static int process_extent_item(struct btrfs_fs_info *fs_info,
 			ret = add_shared_data_ref(fs_info, offset, count,
 						  key->objectid, key->offset);
 			break;
+		case BTRFS_EXTENT_OWNER_REF_KEY:
+			WARN_ON(!btrfs_fs_incompat(fs_info, SIMPLE_QUOTA));
+			break;
 		default:
 			btrfs_err(fs_info, "invalid key type in iref");
 			ret = -EINVAL;
diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index 038dfa8f1788..72d29ab74a01 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -1451,6 +1451,9 @@ static int check_extent_item(struct extent_buffer *leaf,
 			}
 			inline_refs += btrfs_shared_data_ref_count(leaf, sref);
 			break;
+		case BTRFS_EXTENT_OWNER_REF_KEY:
+			WARN_ON(!btrfs_fs_incompat(fs_info, SIMPLE_QUOTA));
+			break;
 		default:
 			extent_err(leaf, slot, "unknown inline ref type: %u",
 				   inline_type);
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 47aca414a41b..eacb26caf3c6 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -226,6 +226,8 @@
 
 #define BTRFS_SHARED_DATA_REF_KEY	184
 
+#define BTRFS_EXTENT_OWNER_REF_KEY	190
+
 /*
  * block groups give us hints into the extent allocation trees.  Which
  * blocks are free etc etc
@@ -783,6 +785,10 @@ struct btrfs_shared_data_ref {
 	__le32 count;
 } __attribute__ ((__packed__));
 
+struct btrfs_extent_owner_ref {
+	__le64 root_id;
+} __attribute__ ((__packed__));
+
 struct btrfs_extent_inline_ref {
 	__u8 type;
 	__le64 offset;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 12/18] btrfs: inline owner ref lookup helper
  2023-07-27 22:12 [PATCH v5 00/18] btrfs: simple quotas Boris Burkov
                   ` (10 preceding siblings ...)
  2023-07-27 22:12 ` [PATCH v5 11/18] btrfs: new inline ref storing owning subvol of data extents Boris Burkov
@ 2023-07-27 22:12 ` Boris Burkov
  2023-09-07 12:10   ` David Sterba
  2023-07-27 22:13 ` [PATCH v5 13/18] btrfs: record simple quota deltas Boris Burkov
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 53+ messages in thread
From: Boris Burkov @ 2023-07-27 22:12 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Inline ref parsing is a bit tricky and relies on a decent amount of
implicit information, so I think it is beneficial to have a helper
function for reading the owner ref, if only to "document" the format,
along with the write path.

The main subtlety of note which I was missing by open-coding this was
that it is important to check whether or not inline refs are present
*at all*. i.e., if we are writing out a new extent under squotas, we
will always use a big enough item for the inline ref and have it.
However, it is possible that some random item predating squotas will not
have any inline refs. In that case, trying to read the "type" field of
the first inline ref will just be reading garbage in the form of
whatever is in the next item.

This will be used by the extent free-ing path, which looks up data
extent owners as well as a relocation path which needs to grab the owner
before relocating an extent.

Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-tree.c | 51 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/extent-tree.h |  3 +++
 2 files changed, 54 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index c6d537bf5ad4..09fb321fa560 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2805,6 +2805,57 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
 	return 0;
 }
 
+/*
+ * Helper to parse an extent item's inline extents looking for a simple
+ * quotas owner ref.
+ *
+ * @fs_info  - the btrfs_fs_info for this mount
+ * @leaf     - a leaf in the extent tree containing the extent item
+ * @slot     - the slot in the leaf where the extent item is found
+ *
+ * Returns the objectid of the root that originally allocated the extent item
+ * if the inline owner ref is expected and present, otherwise 0.
+ *
+ * If an extent item has an owner ref item, it will be the first
+ * inline ref item. Therefore the logic is to check whether there are
+ * any inline ref items, then check the type of the first one.
+ *
+ */
+u64 btrfs_get_extent_owner_root(struct btrfs_fs_info *fs_info,
+				struct extent_buffer *leaf,
+				int slot)
+{
+	struct btrfs_extent_item *ei;
+	struct btrfs_extent_inline_ref *iref;
+	struct btrfs_extent_owner_ref *oref;
+	unsigned long ptr;
+	unsigned long end;
+	int type;
+
+	if (!btrfs_fs_incompat(fs_info, SIMPLE_QUOTA))
+		return 0;
+
+	ei = btrfs_item_ptr(leaf, slot, struct btrfs_extent_item);
+	ptr = (unsigned long)(ei + 1);
+	end = (unsigned long)ei + btrfs_item_size(leaf, slot);
+
+	/* No inline ref items of any kind, can't check type */
+	if (ptr == end)
+		return 0;
+
+	iref = (struct btrfs_extent_inline_ref *)ptr;
+	type = btrfs_get_extent_inline_ref_type(leaf, iref, BTRFS_REF_TYPE_ANY);
+
+	/* We found an owner ref, get the root out of it */
+	if (type == BTRFS_EXTENT_OWNER_REF_KEY) {
+		oref = (struct btrfs_extent_owner_ref *)(&iref->offset);
+		return btrfs_extent_owner_ref_root_id(leaf, oref);
+	}
+
+	/* We have inline refs, but not an owner ref */
+	return 0;
+}
+
 static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
 				     u64 bytenr, u64 num_bytes, bool is_data)
 {
diff --git a/fs/btrfs/extent-tree.h b/fs/btrfs/extent-tree.h
index b9e148adcd28..7c27652880a2 100644
--- a/fs/btrfs/extent-tree.h
+++ b/fs/btrfs/extent-tree.h
@@ -141,6 +141,9 @@ int btrfs_set_disk_extent_flags(struct btrfs_trans_handle *trans,
 				struct extent_buffer *eb, u64 flags);
 int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_ref *ref);
 
+u64 btrfs_get_extent_owner_root(struct btrfs_fs_info *fs_info,
+				struct extent_buffer *leaf,
+				int slot);
 int btrfs_free_reserved_extent(struct btrfs_fs_info *fs_info,
 			       u64 start, u64 len, int delalloc);
 int btrfs_pin_reserved_extent(struct btrfs_trans_handle *trans, u64 start, u64 len);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 13/18] btrfs: record simple quota deltas
  2023-07-27 22:12 [PATCH v5 00/18] btrfs: simple quotas Boris Burkov
                   ` (11 preceding siblings ...)
  2023-07-27 22:12 ` [PATCH v5 12/18] btrfs: inline owner ref lookup helper Boris Burkov
@ 2023-07-27 22:13 ` Boris Burkov
  2023-08-21 18:08   ` Josef Bacik
  2023-09-07 12:12   ` David Sterba
  2023-07-27 22:13 ` [PATCH v5 14/18] btrfs: simple quota auto hierarchy for nested subvols Boris Burkov
                   ` (5 subsequent siblings)
  18 siblings, 2 replies; 53+ messages in thread
From: Boris Burkov @ 2023-07-27 22:13 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

At the moment that we run delayed refs, we make the final ref-count
based decision on creating/removing extent (and metadata) items.
Therefore, it is exactly the spot to hook up simple quotas.

There are a few important subtleties to the fields we must collect to
accurately track simple quotas, particularly when removing an extent.
When removing a data extent, the ref could be in any tree (due to
reflink, for example) and so we need to recover the owning root id from
the owner ref item. When removing a metadata extent, we know the owning
root from the owner field in the header when we create the delayed ref,
so we can recover it from there.

We must also be careful to handle reservations properly to not leaked
reserved space. The happy path is freeing the reservation when the
simple quota delta runs on a data extent. If that doesn't happen, due to
refs canceling out or some error, the ref head already has the
must_insert_reserved machinery to handle this, so we piggy back on that
and use it to clean up the reserved data.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/delayed-ref.c |  1 +
 fs/btrfs/delayed-ref.h |  6 +++
 fs/btrfs/extent-tree.c | 85 +++++++++++++++++++++++++++++++++++++-----
 3 files changed, 82 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 28ba7a9eb3c3..874c1853d9b1 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -745,6 +745,7 @@ static void init_delayed_ref_head(struct btrfs_delayed_ref_head *head_ref,
 	head_ref->bytenr = bytenr;
 	head_ref->num_bytes = num_bytes;
 	head_ref->ref_mod = count_mod;
+	head_ref->reserved_bytes = reserved;
 	head_ref->must_insert_reserved = must_insert_reserved;
 	head_ref->owning_root = owning_root;
 	head_ref->is_data = is_data;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 71f0a6e5d583..221d400dd88f 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -104,6 +104,12 @@ struct btrfs_delayed_ref_head {
 	 */
 	int ref_mod;
 
+	/*
+	 * Track reserved bytes when setting must_insert_reserved.
+	 * On success or cleanup, we will need to free the reservation.
+	 */
+	u64 reserved_bytes;
+
 	/*
 	 * when a new extent is allocated, it is just reserved in memory
 	 * The actual extent isn't inserted into the extent allocation tree
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 09fb321fa560..1b5efd03ef83 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -47,6 +47,7 @@
 
 
 static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
+			       struct btrfs_delayed_ref_head *href,
 			       struct btrfs_delayed_ref_node *node, u64 parent,
 			       u64 root_objectid, u64 owner_objectid,
 			       u64 owner_offset, int refs_to_drop,
@@ -1482,6 +1483,7 @@ static int __btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 }
 
 static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
+				struct btrfs_delayed_ref_head *href,
 				struct btrfs_delayed_ref_node *node,
 				struct btrfs_delayed_extent_op *extent_op,
 				bool insert_reserved)
@@ -1505,18 +1507,28 @@ static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
 	ref_root = ref->root;
 
 	if (node->action == BTRFS_ADD_DELAYED_REF && insert_reserved) {
+		struct btrfs_simple_quota_delta delta = {
+			.root = href->owning_root,
+			.num_bytes = node->num_bytes,
+			.rsv_bytes = href->reserved_bytes,
+			.is_data = true,
+			.is_inc	= true,
+		};
+
 		if (extent_op)
 			flags |= extent_op->flags_to_set;
 		ret = alloc_reserved_file_extent(trans, parent, ref_root,
 						 flags, ref->objectid,
 						 ref->offset, &ins,
 						 node->ref_mod);
+		if (!ret)
+			ret = btrfs_record_simple_quota_delta(trans->fs_info, &delta);
 	} else if (node->action == BTRFS_ADD_DELAYED_REF) {
 		ret = __btrfs_inc_extent_ref(trans, node, parent, ref_root,
 					     ref->objectid, ref->offset,
 					     node->ref_mod, extent_op);
 	} else if (node->action == BTRFS_DROP_DELAYED_REF) {
-		ret = __btrfs_free_extent(trans, node, parent,
+		ret = __btrfs_free_extent(trans, href, node, parent,
 					  ref_root, ref->objectid,
 					  ref->offset, node->ref_mod,
 					  extent_op);
@@ -1632,11 +1644,13 @@ static int run_delayed_extent_op(struct btrfs_trans_handle *trans,
 }
 
 static int run_delayed_tree_ref(struct btrfs_trans_handle *trans,
+				struct btrfs_delayed_ref_head *href,
 				struct btrfs_delayed_ref_node *node,
 				struct btrfs_delayed_extent_op *extent_op,
 				bool insert_reserved)
 {
 	int ret = 0;
+	struct btrfs_fs_info *fs_info = trans->fs_info;
 	struct btrfs_delayed_tree_ref *ref;
 	u64 parent = 0;
 	u64 ref_root = 0;
@@ -1656,13 +1670,23 @@ static int run_delayed_tree_ref(struct btrfs_trans_handle *trans,
 		return -EIO;
 	}
 	if (node->action == BTRFS_ADD_DELAYED_REF && insert_reserved) {
+		struct btrfs_simple_quota_delta delta = {
+			.root = href->owning_root,
+			.num_bytes = fs_info->nodesize,
+			.rsv_bytes = 0,
+			.is_data = false,
+			.is_inc = true,
+		};
+
 		BUG_ON(!extent_op || !extent_op->update_flags);
 		ret = alloc_reserved_tree_block(trans, node, extent_op);
+		if (!ret)
+			btrfs_record_simple_quota_delta(fs_info, &delta);
 	} else if (node->action == BTRFS_ADD_DELAYED_REF) {
 		ret = __btrfs_inc_extent_ref(trans, node, parent, ref_root,
 					     ref->level, 0, 1, extent_op);
 	} else if (node->action == BTRFS_DROP_DELAYED_REF) {
-		ret = __btrfs_free_extent(trans, node, parent, ref_root,
+		ret = __btrfs_free_extent(trans, href, node, parent, ref_root,
 					  ref->level, 0, 1, extent_op);
 	} else {
 		BUG();
@@ -1672,6 +1696,7 @@ static int run_delayed_tree_ref(struct btrfs_trans_handle *trans,
 
 /* helper function to actually process a single delayed ref entry */
 static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
+			       struct btrfs_delayed_ref_head *href,
 			       struct btrfs_delayed_ref_node *node,
 			       struct btrfs_delayed_extent_op *extent_op,
 			       bool insert_reserved)
@@ -1686,12 +1711,12 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
 
 	if (node->type == BTRFS_TREE_BLOCK_REF_KEY ||
 	    node->type == BTRFS_SHARED_BLOCK_REF_KEY)
-		ret = run_delayed_tree_ref(trans, node, extent_op,
+		ret = run_delayed_tree_ref(trans, href, node, extent_op,
 					   insert_reserved);
 	else if (node->type == BTRFS_EXTENT_DATA_REF_KEY ||
 		 node->type == BTRFS_SHARED_DATA_REF_KEY)
-		ret = run_delayed_data_ref(trans, node, extent_op,
-					   insert_reserved);
+		ret = run_delayed_data_ref(trans, href, node,
+					   extent_op, insert_reserved);
 	else if (node->type == BTRFS_EXTENT_OWNER_REF_KEY)
 		ret = 0;
 	else
@@ -1788,6 +1813,11 @@ void btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info,
 		spin_unlock(&delayed_refs->lock);
 		nr_items += btrfs_csum_bytes_to_leaves(fs_info, head->num_bytes);
 	}
+	if (head->must_insert_reserved && head->is_data &&
+	    btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
+		btrfs_qgroup_free_refroot(fs_info, head->owning_root,
+					  head->reserved_bytes,
+					  BTRFS_QGROUP_RSV_DATA);
 
 	btrfs_delayed_refs_rsv_release(fs_info, nr_items);
 }
@@ -1934,8 +1964,8 @@ static int btrfs_run_delayed_refs_for_head(struct btrfs_trans_handle *trans,
 		locked_ref->extent_op = NULL;
 		spin_unlock(&locked_ref->lock);
 
-		ret = run_one_delayed_ref(trans, ref, extent_op,
-					  must_insert_reserved);
+		ret = run_one_delayed_ref(trans, locked_ref, ref,
+					  extent_op, must_insert_reserved);
 
 		btrfs_free_delayed_extent_op(extent_op);
 		if (ret) {
@@ -2857,11 +2887,12 @@ u64 btrfs_get_extent_owner_root(struct btrfs_fs_info *fs_info,
 }
 
 static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
-				     u64 bytenr, u64 num_bytes, bool is_data)
+				     u64 bytenr, struct btrfs_simple_quota_delta *delta)
 {
 	int ret;
+	u64 num_bytes = delta->num_bytes;
 
-	if (is_data) {
+	if (delta->is_data) {
 		struct btrfs_root *csum_root;
 
 		csum_root = btrfs_csum_root(trans->fs_info, bytenr);
@@ -2872,6 +2903,12 @@ static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
 		}
 	}
 
+	ret = btrfs_record_simple_quota_delta(trans->fs_info, delta);
+	if (ret) {
+		btrfs_abort_transaction(trans, ret);
+		return ret;
+	}
+
 	ret = add_to_free_space_tree(trans, bytenr, num_bytes);
 	if (ret) {
 		btrfs_abort_transaction(trans, ret);
@@ -2952,6 +2989,7 @@ static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
  * And that (13631488 EXTENT_DATA_REF <HASH>) gets removed.
  */
 static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
+			       struct btrfs_delayed_ref_head *href,
 			       struct btrfs_delayed_ref_node *node, u64 parent,
 			       u64 root_objectid, u64 owner_objectid,
 			       u64 owner_offset, int refs_to_drop,
@@ -2974,6 +3012,7 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 	u64 bytenr = node->bytenr;
 	u64 num_bytes = node->num_bytes;
 	bool skinny_metadata = btrfs_fs_incompat(info, SKINNY_METADATA);
+	u64 delayed_ref_root = href->owning_root;
 
 	extent_root = btrfs_extent_root(info, bytenr);
 	ASSERT(extent_root);
@@ -3172,6 +3211,14 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 			}
 		}
 	} else {
+		struct btrfs_simple_quota_delta delta = {
+			.root = delayed_ref_root,
+			.num_bytes = num_bytes,
+			.rsv_bytes = 0,
+			.is_data = is_data,
+			.is_inc = false,
+		};
+
 		/* In this branch refs == 1 */
 		if (found_extent) {
 			if (is_data && refs_to_drop !=
@@ -3210,6 +3257,16 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 				num_to_del = 2;
 			}
 		}
+		/*
+		 * We can't infer the data owner from the delayed ref, so we
+		 * need to try to get it from the owning ref item.
+		 *
+		 * If it is not present, then that extent was not written under
+		 * simple quotas mode, so we don't need to account for its
+		 * deletion.
+		 */
+		if (is_data)
+			delta.root = btrfs_get_extent_owner_root(trans->fs_info, leaf, extent_slot);
 
 		ret = btrfs_del_items(trans, extent_root, path, path->slots[0],
 				      num_to_del);
@@ -3219,7 +3276,7 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 		}
 		btrfs_release_path(path);
 
-		ret = do_free_extent_accounting(trans, bytenr, num_bytes, is_data);
+		ret = do_free_extent_accounting(trans, bytenr, &delta);
 	}
 	btrfs_release_path(path);
 
@@ -4790,6 +4847,13 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
 	int ret;
 	struct btrfs_block_group *block_group;
 	struct btrfs_space_info *space_info;
+	struct btrfs_simple_quota_delta delta = {
+		.root = root_objectid,
+		.num_bytes = ins->offset,
+		.rsv_bytes = 0,
+		.is_data = true,
+		.is_inc = true,
+	};
 
 	/*
 	 * Mixed block groups will exclude before processing the log so we only
@@ -4818,6 +4882,7 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
 					 offset, ins, 1);
 	if (ret)
 		btrfs_pin_extent(trans, ins->objectid, ins->offset, 1);
+	ret = btrfs_record_simple_quota_delta(fs_info, &delta);
 	btrfs_put_block_group(block_group);
 	return ret;
 }
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 14/18] btrfs: simple quota auto hierarchy for nested subvols
  2023-07-27 22:12 [PATCH v5 00/18] btrfs: simple quotas Boris Burkov
                   ` (12 preceding siblings ...)
  2023-07-27 22:13 ` [PATCH v5 13/18] btrfs: record simple quota deltas Boris Burkov
@ 2023-07-27 22:13 ` Boris Burkov
  2023-08-21 18:10   ` Josef Bacik
  2023-09-07 12:16   ` David Sterba
  2023-07-27 22:13 ` [PATCH v5 15/18] btrfs: check generation when recording simple quota delta Boris Burkov
                   ` (4 subsequent siblings)
  18 siblings, 2 replies; 53+ messages in thread
From: Boris Burkov @ 2023-07-27 22:13 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Consider the following sequence:
- enable quotas
- create subvol S id 256 at dir outer/
- create a qgroup 1/100
- add 0/256 (S's auto qgroup) to 1/100
- create subvol T id 257 at dir outer/inner/

With full qgroups, there is no relationship between 0/257 and either of
0/256 or 1/100. There is an inherit feature that the creator of inner/
can use to specify it ought to be in 1/100.

Simple quotas are targeted at container isolation, where such automatic
inheritance for not necessarily trusted/controlled nested subvol
creation would be quite helpful. Therefore, add a new default behavior
for simple quotas: when you create a nested subvol, automatically
inherit as parents any parents of the qgroup of the subvol the new inode
is going in.

In our example, 257/0 would also be under 1/100, allowing easy control
of a total quota over an arbitrary hierarchy of subvolumes.

I think this _might_ be a generally useful behavior, so it could be
interesting to put it behind a new inheritance flag that simple quotas
always use while traditional quotas let the user specify, but this is a
minimally intrusive change to start.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/ioctl.c       |  2 +-
 fs/btrfs/qgroup.c      | 44 +++++++++++++++++++++++++++++++++++++++---
 fs/btrfs/qgroup.h      |  6 +++---
 fs/btrfs/transaction.c | 13 +++++++++----
 4 files changed, 54 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 9b61bc62e439..c9b069077fd0 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -652,7 +652,7 @@ static noinline int create_subvol(struct mnt_idmap *idmap,
 	/* Tree log can't currently deal with an inode which is a new root. */
 	btrfs_set_log_full_commit(trans);
 
-	ret = btrfs_qgroup_inherit(trans, 0, objectid, inherit);
+	ret = btrfs_qgroup_inherit(trans, 0, objectid, root->root_key.objectid, inherit);
 	if (ret)
 		goto out;
 
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index dedc532669f4..58e9ed0deedd 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1550,8 +1550,7 @@ static int quick_update_accounting(struct btrfs_fs_info *fs_info,
 	return ret;
 }
 
-int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src,
-			      u64 dst)
+int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src, u64 dst)
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
 	struct btrfs_qgroup *parent;
@@ -2991,6 +2990,40 @@ int btrfs_run_qgroups(struct btrfs_trans_handle *trans)
 	return ret;
 }
 
+static int qgroup_auto_inherit(struct btrfs_fs_info *fs_info,
+			       u64 inode_rootid,
+			       struct btrfs_qgroup_inherit **inherit)
+{
+	int i = 0;
+	u64 num_qgroups = 0;
+	struct btrfs_qgroup *inode_qg;
+	struct btrfs_qgroup_list *qg_list;
+
+	if (*inherit)
+		return -EEXIST;
+
+	inode_qg = find_qgroup_rb(fs_info, inode_rootid);
+	if (!inode_qg)
+		return -ENOENT;
+
+	num_qgroups = list_count_nodes(&inode_qg->groups);
+
+	if (!num_qgroups)
+		return 0;
+
+	*inherit = kzalloc(sizeof(**inherit) + num_qgroups * sizeof(u64), GFP_NOFS);
+	if (!*inherit)
+		return -ENOMEM;
+	(*inherit)->num_qgroups = num_qgroups;
+
+	list_for_each_entry(qg_list, &inode_qg->groups, next_group) {
+		u64 qg_id = qg_list->group->qgroupid;
+		*((u64 *)((*inherit)+1) + i) = qg_id;
+	}
+
+	return 0;
+}
+
 /*
  * Copy the accounting information between qgroups. This is necessary
  * when a snapshot or a subvolume is created. Throwing an error will
@@ -2998,7 +3031,8 @@ int btrfs_run_qgroups(struct btrfs_trans_handle *trans)
  * when a readonly fs is a reasonable outcome.
  */
 int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
-			 u64 objectid, struct btrfs_qgroup_inherit *inherit)
+			 u64 objectid, u64 inode_rootid,
+			 struct btrfs_qgroup_inherit *inherit)
 {
 	int ret = 0;
 	int i;
@@ -3040,6 +3074,9 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
 		goto out;
 	}
 
+	if (!inherit && btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
+		qgroup_auto_inherit(fs_info, inode_rootid, &inherit);
+
 	if (inherit) {
 		i_qgroups = (u64 *)(inherit + 1);
 		nums = inherit->num_qgroups + 2 * inherit->num_ref_copies +
@@ -3066,6 +3103,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
 	if (ret)
 		goto out;
 
+
 	/*
 	 * add qgroup to all inherited groups
 	 */
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 94d85b4fbebd..ce6fa8694ca7 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -271,8 +271,7 @@ int btrfs_qgroup_rescan(struct btrfs_fs_info *fs_info);
 void btrfs_qgroup_rescan_resume(struct btrfs_fs_info *fs_info);
 int btrfs_qgroup_wait_for_completion(struct btrfs_fs_info *fs_info,
 				     bool interruptible);
-int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src,
-			      u64 dst);
+int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src, u64 dst);
 int btrfs_del_qgroup_relation(struct btrfs_trans_handle *trans, u64 src,
 			      u64 dst);
 int btrfs_create_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid);
@@ -366,7 +365,8 @@ int btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans, u64 bytenr,
 int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans);
 int btrfs_run_qgroups(struct btrfs_trans_handle *trans);
 int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
-			 u64 objectid, struct btrfs_qgroup_inherit *inherit);
+			 u64 objectid, u64 inode_rootid,
+			 struct btrfs_qgroup_inherit *inherit);
 void btrfs_qgroup_free_refroot(struct btrfs_fs_info *fs_info,
 			       u64 ref_root, u64 num_bytes,
 			       enum btrfs_qgroup_rsv_type type);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 25217888e897..fb857147df57 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1529,13 +1529,14 @@ static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
 	int ret;
 
 	/*
-	 * Save some performance in the case that full qgroups are not
+	 * Save some performance in the case that qgroups are not
 	 * enabled. If this check races with the ioctl, rescan will
 	 * kick in anyway.
 	 */
 	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
 		return 0;
 
+
 	/*
 	 * Ensure dirty @src will be committed.  Or, after coming
 	 * commit_fs_roots() and switch_commit_roots(), any dirty but not
@@ -1572,7 +1573,7 @@ static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
 
 	/* Now qgroup are all updated, we can inherit it to new qgroups */
 	ret = btrfs_qgroup_inherit(trans, src->root_key.objectid, dst_objectid,
-				   inherit);
+				   parent->root_key.objectid, inherit);
 	if (ret < 0)
 		goto out;
 
@@ -1839,8 +1840,12 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
 	 * To co-operate with that hack, we do hack again.
 	 * Or snapshot will be greatly slowed down by a subtree qgroup rescan
 	 */
-	ret = qgroup_account_snapshot(trans, root, parent_root,
-				      pending->inherit, objectid);
+	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_FULL)
+		ret = qgroup_account_snapshot(trans, root, parent_root,
+					      pending->inherit, objectid);
+	else if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
+		ret = btrfs_qgroup_inherit(trans, root->root_key.objectid, objectid,
+					   parent_root->root_key.objectid, pending->inherit);
 	if (ret < 0)
 		goto fail;
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 15/18] btrfs: check generation when recording simple quota delta
  2023-07-27 22:12 [PATCH v5 00/18] btrfs: simple quotas Boris Burkov
                   ` (13 preceding siblings ...)
  2023-07-27 22:13 ` [PATCH v5 14/18] btrfs: simple quota auto hierarchy for nested subvols Boris Burkov
@ 2023-07-27 22:13 ` Boris Burkov
  2023-08-21 18:11   ` Josef Bacik
  2023-09-07 12:24   ` David Sterba
  2023-07-27 22:13 ` [PATCH v5 16/18] btrfs: track metadata relocation cow with simple quota Boris Burkov
                   ` (3 subsequent siblings)
  18 siblings, 2 replies; 53+ messages in thread
From: Boris Burkov @ 2023-07-27 22:13 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Simple quotas count extents only from the moment the feature is enabled.
Therefore, if we do something like:
1. create subvol S
2. write F in S
3. enable quotas
4. remove F
5. write G in S

then after 3. and 4. we would expect the simple quota usage of S to be 0
(putting aside some metadata extents that might be written) and after
5., it should be the size of G plus metadata. Therefore, we need to be
able to determine whether a particular quota delta we are processing
predates simple quota enablement.

To do this, store the transaction id when quotas were enabled. In
fs_info for immediate use and in the quota status item to make it
recoverable on mount. When we see a delta, check if the generation of
the extent item is less than that of quota enablement. If so, we should
ignore the delta from this extent.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/accessors.h            |  2 ++
 fs/btrfs/extent-tree.c          |  4 ++++
 fs/btrfs/fs.h                   |  2 ++
 fs/btrfs/qgroup.c               | 14 ++++++++++++--
 fs/btrfs/qgroup.h               |  1 +
 include/uapi/linux/btrfs_tree.h |  7 +++++++
 6 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
index a23045c05937..513f8edbd98e 100644
--- a/fs/btrfs/accessors.h
+++ b/fs/btrfs/accessors.h
@@ -970,6 +970,8 @@ BTRFS_SETGET_FUNCS(qgroup_status_flags, struct btrfs_qgroup_status_item,
 		   flags, 64);
 BTRFS_SETGET_FUNCS(qgroup_status_rescan, struct btrfs_qgroup_status_item,
 		   rescan, 64);
+BTRFS_SETGET_FUNCS(qgroup_status_enable_gen, struct btrfs_qgroup_status_item,
+		   enable_gen, 64);
 
 /* btrfs_qgroup_info_item */
 BTRFS_SETGET_FUNCS(qgroup_info_generation, struct btrfs_qgroup_info_item,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1b5efd03ef83..395ab46e520b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1513,6 +1513,7 @@ static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
 			.rsv_bytes = href->reserved_bytes,
 			.is_data = true,
 			.is_inc	= true,
+			.generation = trans->transid,
 		};
 
 		if (extent_op)
@@ -1676,6 +1677,7 @@ static int run_delayed_tree_ref(struct btrfs_trans_handle *trans,
 			.rsv_bytes = 0,
 			.is_data = false,
 			.is_inc = true,
+			.generation = trans->transid,
 		};
 
 		BUG_ON(!extent_op || !extent_op->update_flags);
@@ -3217,6 +3219,7 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 			.rsv_bytes = 0,
 			.is_data = is_data,
 			.is_inc = false,
+			.generation = btrfs_extent_generation(leaf, ei),
 		};
 
 		/* In this branch refs == 1 */
@@ -4850,6 +4853,7 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
 	struct btrfs_simple_quota_delta delta = {
 		.root = root_objectid,
 		.num_bytes = ins->offset,
+		.generation = trans->transid,
 		.rsv_bytes = 0,
 		.is_data = true,
 		.is_inc = true,
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index f76f450c2abf..da7b623ff15f 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -802,6 +802,8 @@ struct btrfs_fs_info {
 	spinlock_t eb_leak_lock;
 	struct list_head allocated_ebs;
 #endif
+
+	u64 quota_enable_gen;
 };
 
 static inline void btrfs_set_last_root_drop_gen(struct btrfs_fs_info *fs_info,
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 58e9ed0deedd..a8a603242431 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -454,6 +454,8 @@ int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info)
 			}
 			fs_info->qgroup_flags = btrfs_qgroup_status_flags(l, ptr);
 			simple = fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_SIMPLE;
+			if (simple)
+				fs_info->quota_enable_gen = btrfs_qgroup_status_enable_gen(l, ptr);
 			if (btrfs_qgroup_status_generation(l, ptr) !=
 			    fs_info->generation && !simple) {
 				qgroup_mark_inconsistent(fs_info);
@@ -1107,10 +1109,12 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info,
 	btrfs_set_qgroup_status_generation(leaf, ptr, trans->transid);
 	btrfs_set_qgroup_status_version(leaf, ptr, BTRFS_QGROUP_STATUS_VERSION);
 	fs_info->qgroup_flags = BTRFS_QGROUP_STATUS_FLAG_ON;
-	if (simple)
+	if (simple) {
 		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_SIMPLE;
-	else
+		btrfs_set_qgroup_status_enable_gen(leaf, ptr, trans->transid);
+	} else {
 		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
+	}
 	btrfs_set_qgroup_status_flags(leaf, ptr, fs_info->qgroup_flags &
 				      BTRFS_QGROUP_STATUS_FLAGS_MASK);
 	btrfs_set_qgroup_status_rescan(leaf, ptr, 0);
@@ -1202,6 +1206,8 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info,
 		goto out_free_path;
 	}
 
+	fs_info->quota_enable_gen = trans->transid;
+
 	mutex_unlock(&fs_info->qgroup_ioctl_lock);
 	/*
 	 * Commit the transaction while not holding qgroup_ioctl_lock, to avoid
@@ -4622,6 +4628,10 @@ int btrfs_record_simple_quota_delta(struct btrfs_fs_info *fs_info,
 	if (!is_fstree(root))
 		return 0;
 
+	/* If the extent predates enabling quotas, don't count it. */
+	if (delta->generation < fs_info->quota_enable_gen)
+		return 0;
+
 	spin_lock(&fs_info->qgroup_lock);
 	qgroup = find_qgroup_rb(fs_info, root);
 	if (!qgroup) {
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index ce6fa8694ca7..ae1ce14b365c 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -241,6 +241,7 @@ struct btrfs_simple_quota_delta {
 	u64 rsv_bytes; /* The number of bytes reserved for this extent */
 	bool is_inc; /* Whether we are using or freeing the extent */
 	bool is_data; /* Whether the extent is data or metadata */
+	u64 generation; /* The generation the extent was created in */
 };
 
 static inline u64 btrfs_qgroup_subvolid(u64 qgroupid)
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index eacb26caf3c6..1120ce3dae42 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -1242,6 +1242,13 @@ struct btrfs_qgroup_status_item {
 	 * of the scan. It contains a logical address
 	 */
 	__le64 rescan;
+
+	/*
+	 * the generation when quotas are enabled. Used by simple quotas to
+	 * avoid decrementing when freeing an extent that was written before
+	 * enable.
+	 */
+	__le64 enable_gen;
 } __attribute__ ((__packed__));
 
 struct btrfs_qgroup_info_item {
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 16/18] btrfs: track metadata relocation cow with simple quota
  2023-07-27 22:12 [PATCH v5 00/18] btrfs: simple quotas Boris Burkov
                   ` (14 preceding siblings ...)
  2023-07-27 22:13 ` [PATCH v5 15/18] btrfs: check generation when recording simple quota delta Boris Burkov
@ 2023-07-27 22:13 ` Boris Burkov
  2023-09-07 12:27   ` David Sterba
  2023-07-27 22:13 ` [PATCH v5 17/18] btrfs: track data relocation " Boris Burkov
                   ` (2 subsequent siblings)
  18 siblings, 1 reply; 53+ messages in thread
From: Boris Burkov @ 2023-07-27 22:13 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Relocation cows metadata blocks in two cases for the reloc root:
- copying the subvol root item when creating the reloc root
- copying a btree node when there is a cow during relocation

In both cases, the resulting btree node hits an abnormal code path with
respect to the owner field in its btrfs_header. It first creates the
root item for the new objectid, which populates the reloc root id, and
it at this point that delayed refs are created.

Later, it fully copies the old node into the new node (including the
original owner field) which overwrites it. This results in a simple
quotas mismatch where we run the delayed ref for the reloc root which
has no simple quota effect (reloc root is not an fstree) but when we
ultimately delete the node, the owner is the real original fstree and we
do free the space.

To work around this without tampering with the behavior of relocation,
add a parameter to btrfs_add_tree_block that lets the relocation code
path specify a different owning root than the "operating" root (in this
case, owning root is the real root and the operating root is the reloc
root). These can naturally be plumbed into delayed refs that have the
same concept.

Note that this is a double count in some sense, but a relatively natural
one, as there are really two extents, and the old one will be deleted
soon. This is consistent with how data relocation extents are accounted
by simple quotas.

Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.c       | 22 ++++++++++++++--------
 fs/btrfs/disk-io.c     |  4 ++--
 fs/btrfs/extent-tree.c |  8 ++++++--
 fs/btrfs/extent-tree.h |  3 ++-
 fs/btrfs/ioctl.c       |  2 +-
 5 files changed, 25 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index a4cb4b642987..cb0d4535de37 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -316,6 +316,7 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans,
 	int ret = 0;
 	int level;
 	struct btrfs_disk_key disk_key;
+	u64 reloc_src_root = 0;
 
 	WARN_ON(test_bit(BTRFS_ROOT_SHAREABLE, &root->state) &&
 		trans->transid != fs_info->running_transaction->transid);
@@ -328,9 +329,11 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans,
 	else
 		btrfs_node_key(buf, &disk_key, 0);
 
+	if (new_root_objectid == BTRFS_TREE_RELOC_OBJECTID)
+		reloc_src_root = btrfs_header_owner(buf);
 	cow = btrfs_alloc_tree_block(trans, root, 0, new_root_objectid,
 				     &disk_key, level, buf->start, 0,
-				     BTRFS_NESTING_NEW_ROOT);
+				     BTRFS_NESTING_NEW_ROOT, reloc_src_root);
 	if (IS_ERR(cow))
 		return PTR_ERR(cow);
 
@@ -522,6 +525,7 @@ static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans,
 	int last_ref = 0;
 	int unlock_orig = 0;
 	u64 parent_start = 0;
+	u64 reloc_src_root = 0;
 
 	if (*cow_ret == buf)
 		unlock_orig = 1;
@@ -540,12 +544,14 @@ static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans,
 	else
 		btrfs_node_key(buf, &disk_key, 0);
 
-	if ((root->root_key.objectid == BTRFS_TREE_RELOC_OBJECTID) && parent)
-		parent_start = parent->start;
-
+	if (root->root_key.objectid == BTRFS_TREE_RELOC_OBJECTID) {
+		if (parent)
+			parent_start = parent->start;
+		reloc_src_root = btrfs_header_owner(buf);
+	}
 	cow = btrfs_alloc_tree_block(trans, root, parent_start,
 				     root->root_key.objectid, &disk_key, level,
-				     search_start, empty_size, nest);
+				     search_start, empty_size, nest, reloc_src_root);
 	if (IS_ERR(cow))
 		return PTR_ERR(cow);
 
@@ -2956,7 +2962,7 @@ static noinline int insert_new_root(struct btrfs_trans_handle *trans,
 
 	c = btrfs_alloc_tree_block(trans, root, 0, root->root_key.objectid,
 				   &lower_key, level, root->node->start, 0,
-				   BTRFS_NESTING_NEW_ROOT);
+				   BTRFS_NESTING_NEW_ROOT, 0);
 	if (IS_ERR(c))
 		return PTR_ERR(c);
 
@@ -3100,7 +3106,7 @@ static noinline int split_node(struct btrfs_trans_handle *trans,
 
 	split = btrfs_alloc_tree_block(trans, root, 0, root->root_key.objectid,
 				       &disk_key, level, c->start, 0,
-				       BTRFS_NESTING_SPLIT);
+				       BTRFS_NESTING_SPLIT, 0);
 	if (IS_ERR(split))
 		return PTR_ERR(split);
 
@@ -3853,7 +3859,7 @@ static noinline int split_leaf(struct btrfs_trans_handle *trans,
 	right = btrfs_alloc_tree_block(trans, root, 0, root->root_key.objectid,
 				       &disk_key, 0, l->start, 0,
 				       num_doubles ? BTRFS_NESTING_NEW_ROOT :
-				       BTRFS_NESTING_SPLIT);
+				       BTRFS_NESTING_SPLIT, 0);
 	if (IS_ERR(right))
 		return PTR_ERR(right);
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b4495d4c1533..e2b0e11800fc 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -862,7 +862,7 @@ struct btrfs_root *btrfs_create_tree(struct btrfs_trans_handle *trans,
 	root->root_key.offset = 0;
 
 	leaf = btrfs_alloc_tree_block(trans, root, 0, objectid, NULL, 0, 0, 0,
-				      BTRFS_NESTING_NORMAL);
+				      BTRFS_NESTING_NORMAL, 0);
 	if (IS_ERR(leaf)) {
 		ret = PTR_ERR(leaf);
 		leaf = NULL;
@@ -939,7 +939,7 @@ int btrfs_alloc_log_tree_node(struct btrfs_trans_handle *trans,
 	 */
 
 	leaf = btrfs_alloc_tree_block(trans, root, 0, BTRFS_TREE_LOG_OBJECTID,
-			NULL, 0, 0, 0, BTRFS_NESTING_NORMAL);
+			NULL, 0, 0, 0, BTRFS_NESTING_NORMAL, 0);
 	if (IS_ERR(leaf))
 		return PTR_ERR(leaf);
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 395ab46e520b..50db75529a83 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4989,7 +4989,8 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
 					     const struct btrfs_disk_key *key,
 					     int level, u64 hint,
 					     u64 empty_size,
-					     enum btrfs_lock_nesting nest)
+					     enum btrfs_lock_nesting nest,
+					     u64 reloc_src_root)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_key ins;
@@ -5001,6 +5002,7 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
 	int ret;
 	u32 blocksize = fs_info->nodesize;
 	bool skinny_metadata = btrfs_fs_incompat(fs_info, SKINNY_METADATA);
+	u64 owning_root;
 
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 	if (btrfs_is_testing(fs_info)) {
@@ -5027,11 +5029,13 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
 		ret = PTR_ERR(buf);
 		goto out_free_reserved;
 	}
+	owning_root = btrfs_header_owner(buf);
 
 	if (root_objectid == BTRFS_TREE_RELOC_OBJECTID) {
 		if (parent == 0)
 			parent = ins.objectid;
 		flags |= BTRFS_BLOCK_FLAG_FULL_BACKREF;
+		owning_root = reloc_src_root;
 	} else
 		BUG_ON(parent > 0);
 
@@ -5051,7 +5055,7 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
 		extent_op->level = level;
 
 		btrfs_init_generic_ref(&generic_ref, BTRFS_ADD_DELAYED_EXTENT,
-				       ins.objectid, ins.offset, parent, btrfs_header_owner(buf));
+				       ins.objectid, ins.offset, parent, owning_root);
 		btrfs_init_tree_ref(&generic_ref, level, root_objectid,
 				    root->root_key.objectid, false);
 		btrfs_ref_tree_mod(fs_info, &generic_ref);
diff --git a/fs/btrfs/extent-tree.h b/fs/btrfs/extent-tree.h
index 7c27652880a2..99b11e278ae4 100644
--- a/fs/btrfs/extent-tree.h
+++ b/fs/btrfs/extent-tree.h
@@ -118,7 +118,8 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
 					     const struct btrfs_disk_key *key,
 					     int level, u64 hint,
 					     u64 empty_size,
-					     enum btrfs_lock_nesting nest);
+					     enum btrfs_lock_nesting nest,
+					     u64 reloc_src_root);
 void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 			   u64 root_id,
 			   struct extent_buffer *buf,
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index c9b069077fd0..f3807def6596 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -657,7 +657,7 @@ static noinline int create_subvol(struct mnt_idmap *idmap,
 		goto out;
 
 	leaf = btrfs_alloc_tree_block(trans, root, 0, objectid, NULL, 0, 0, 0,
-				      BTRFS_NESTING_NORMAL);
+				      BTRFS_NESTING_NORMAL, 0);
 	if (IS_ERR(leaf)) {
 		ret = PTR_ERR(leaf);
 		goto out;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 17/18] btrfs: track data relocation with simple quota
  2023-07-27 22:12 [PATCH v5 00/18] btrfs: simple quotas Boris Burkov
                   ` (15 preceding siblings ...)
  2023-07-27 22:13 ` [PATCH v5 16/18] btrfs: track metadata relocation cow with simple quota Boris Burkov
@ 2023-07-27 22:13 ` Boris Burkov
  2023-08-21 18:16   ` Josef Bacik
  2023-07-27 22:13 ` [PATCH v5 18/18] btrfs: only set QUOTA_ENABLED when done reading qgroups Boris Burkov
  2023-09-07 10:51 ` [PATCH v5 00/18] btrfs: simple quotas David Sterba
  18 siblings, 1 reply; 53+ messages in thread
From: Boris Burkov @ 2023-07-27 22:13 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Relocation data allocations are quite tricky for simple quotas. The
basic data relocation sequence is (ignoring details that aren't relevant
to this fix):
- create a fake relocation data fs root
- create a fake relocation inode in that root
- foreach data extent:
  - preallocate a data extent on behalf of the fake inode
  - copy over the data
- foreach extent
  - swap the refs so that the original file extent now refers to the new
    extent item
- drop the fake root, dropping its refs on the old extents, which lets
  us delete them.

Done naively, this results in storing an extent item in the extent tree
whose owner_ref points at the relocation data root and a no-op squota
recording, since the reloc root is not a legit fstree. So far, that's
OK. The problem comes when you do the swap, and leave an extent item
owned by this bogus root as the real permanent extents of the file. If
the file then drops that ref, we free it and no-op account that against
the fake relocation root. Essentially, this means that relocation is
simple quota "extent laundering", since we re-own the extents into a
fake root.

Simple quotas very intentionally doesn't have a mechanism for
transferring ownership of extents, as that is exactly the complicated
thing we are trying to avoid with the new design. Further, it cannot be
correctly done in this case, since at the time you create the new
"real" refs, there is no way to know which was the original owner before
relocation unless we track it.

Therefore, it makes more sense to trick the preallocation to handle
relocation as a special case and note the proper owner ref from the
beginning. That way, we never write out an extent item without the
correct owner ref that it will eventually have.

This could be done by wiring a special root parameter all the way
through the allocation code path, but to avoid that special case
touching all the code, take advantage of the serial nature of relocation
to store the src root on the relocation root object. Then when we finish
the prealloc, if it happens to be this case, prepare the delayed ref
appropriately.

We must also add logic to handle relocating adjacent extents with
different owning roots. Those cannot be preallocated together in a
cluster as it would lose the separate ownership information.

This is obviously a smelly bit of code, but I think it is the best
solution to the problem, given the relocation implementation.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/ctree.h       |  1 +
 fs/btrfs/extent-tree.c | 13 ++++++-----
 fs/btrfs/relocation.c  | 49 +++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 57 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f2d2b313bde5..577186994188 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -333,6 +333,7 @@ struct btrfs_root {
 #ifdef CONFIG_BTRFS_DEBUG
 	struct list_head leak_list;
 #endif
+	u64 relocation_src_root;
 };
 
 static inline bool btrfs_root_readonly(const struct btrfs_root *root)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 50db75529a83..eb132200833d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -58,7 +58,7 @@ static void __run_delayed_extent_op(struct btrfs_delayed_extent_op *extent_op,
 static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
 				      u64 parent, u64 root_objectid,
 				      u64 flags, u64 owner, u64 offset,
-				      struct btrfs_key *ins, int ref_mod);
+				      struct btrfs_key *ins, int ref_mod, u64 oref_root);
 static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
 				     struct btrfs_delayed_ref_node *node,
 				     struct btrfs_delayed_extent_op *extent_op);
@@ -1521,7 +1521,7 @@ static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
 		ret = alloc_reserved_file_extent(trans, parent, ref_root,
 						 flags, ref->objectid,
 						 ref->offset, &ins,
-						 node->ref_mod);
+						 node->ref_mod, href->owning_root);
 		if (!ret)
 			ret = btrfs_record_simple_quota_delta(trans->fs_info, &delta);
 	} else if (node->action == BTRFS_ADD_DELAYED_REF) {
@@ -4669,7 +4669,7 @@ static int alloc_reserved_extent(struct btrfs_trans_handle *trans, u64 bytenr,
 static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
 				      u64 parent, u64 root_objectid,
 				      u64 flags, u64 owner, u64 offset,
-				      struct btrfs_key *ins, int ref_mod)
+				      struct btrfs_key *ins, int ref_mod, u64 oref_root)
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
 	struct btrfs_root *extent_root;
@@ -4717,7 +4717,7 @@ static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
 	if (simple_quota) {
 		btrfs_set_extent_inline_ref_type(leaf, iref, BTRFS_EXTENT_OWNER_REF_KEY);
 		oref = (struct btrfs_extent_owner_ref *)(&iref->offset);
-		btrfs_set_extent_owner_ref_root_id(leaf, oref, root_objectid);
+		btrfs_set_extent_owner_ref_root_id(leaf, oref, oref_root);
 		iref = (struct btrfs_extent_inline_ref *)(oref + 1);
 	}
 	btrfs_set_extent_inline_ref_type(leaf, iref, type);
@@ -4828,6 +4828,9 @@ int btrfs_alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
 
 	BUG_ON(root_objectid == BTRFS_TREE_LOG_OBJECTID);
 
+	if (btrfs_is_data_reloc_root(root) && is_fstree(root->relocation_src_root))
+		owning_root = root->relocation_src_root;
+
 	btrfs_init_generic_ref(&generic_ref, BTRFS_ADD_DELAYED_EXTENT,
 			       ins->objectid, ins->offset, 0, owning_root);
 	btrfs_init_data_ref(&generic_ref, root_objectid, owner,
@@ -4883,7 +4886,7 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
 	spin_unlock(&space_info->lock);
 
 	ret = alloc_reserved_file_extent(trans, 0, root_objectid, 0, owner,
-					 offset, ins, 1);
+					 offset, ins, 1, root_objectid);
 	if (ret)
 		btrfs_pin_extent(trans, ins->objectid, ins->offset, 1);
 	ret = btrfs_record_simple_quota_delta(fs_info, &delta);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 3161a48d5970..f8c4a549db3a 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -122,6 +122,7 @@ struct file_extent_cluster {
 	u64 end;
 	u64 boundary[MAX_EXTENTS];
 	unsigned int nr;
+	u64 owning_root;
 };
 
 struct reloc_control {
@@ -3130,6 +3131,7 @@ int relocate_data_extent(struct inode *inode, struct btrfs_key *extent_key,
 			 struct file_extent_cluster *cluster)
 {
 	int ret;
+	struct btrfs_root *root = BTRFS_I(inode)->root;
 
 	if (cluster->nr > 0 && extent_key->objectid != cluster->end + 1) {
 		ret = relocate_file_extent_cluster(inode, cluster);
@@ -3138,8 +3140,38 @@ int relocate_data_extent(struct inode *inode, struct btrfs_key *extent_key,
 		cluster->nr = 0;
 	}
 
-	if (!cluster->nr)
+	/*
+	 * Under simple quotas, we set root->relocation_src_root when we find
+	 * the extent. If adjacent extents have different owners, we can't merge
+	 * them while relocating. Handle this by storing the owning root that
+	 * started a cluster and if we see an extent from a different root break
+	 * cluster formation (just like the above case of non-adjacent extents).
+	 *
+	 * Absent simple quotas, relocation_src_root is always 0, so we should
+	 * never see a mismatch, and it should have no effect on relocation
+	 * clusters.
+	 */
+	if (cluster->nr > 0 && cluster->owning_root != root->relocation_src_root) {
+		u64 tmp = root->relocation_src_root;
+
+		/*
+		 * root->relocation_src_root is the state that actually
+		 * affects the preallocation we do here, so set it to the
+		 * root owning the cluster we need to relocate.
+		 */
+		root->relocation_src_root = cluster->owning_root;
+		ret = relocate_file_extent_cluster(inode, cluster);
+		if (ret)
+			return ret;
+		cluster->nr = 0;
+		/* And reset it back for the current extent's owning root */
+		root->relocation_src_root = tmp;
+	}
+
+	if (!cluster->nr) {
 		cluster->start = extent_key->objectid;
+		cluster->owning_root = root->relocation_src_root;
+	}
 	else
 		BUG_ON(cluster->nr >= MAX_EXTENTS);
 	cluster->end = extent_key->objectid + extent_key->offset - 1;
@@ -3668,6 +3700,21 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc)
 				    struct btrfs_extent_item);
 		flags = btrfs_extent_flags(path->nodes[0], ei);
 
+		/*
+		 * If we are relocating a simple quota owned extent item, we need
+		 * to note the owner on the reloc data root so that when we
+		 * allocate the replacement item, we can attribute it to the
+		 * correct eventual owner (rather than the reloc data root)
+		 */
+		if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE) {
+			struct btrfs_root *root = BTRFS_I(rc->data_inode)->root;
+			u64 owning_root_id = btrfs_get_extent_owner_root(fs_info,
+									 path->nodes[0],
+									 path->slots[0]);
+
+			root->relocation_src_root = owning_root_id;
+		}
+
 		if (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
 			ret = add_tree_block(rc, &key, path, &blocks);
 		} else if (rc->stage == UPDATE_DATA_PTRS &&
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v5 18/18] btrfs: only set QUOTA_ENABLED when done reading qgroups
  2023-07-27 22:12 [PATCH v5 00/18] btrfs: simple quotas Boris Burkov
                   ` (16 preceding siblings ...)
  2023-07-27 22:13 ` [PATCH v5 17/18] btrfs: track data relocation " Boris Burkov
@ 2023-07-27 22:13 ` Boris Burkov
  2023-08-21 18:16   ` Josef Bacik
  2023-09-07 10:51 ` [PATCH v5 00/18] btrfs: simple quotas David Sterba
  18 siblings, 1 reply; 53+ messages in thread
From: Boris Burkov @ 2023-07-27 22:13 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

In open_ctree, we set BTRFS_FS_QUOTA_ENABLED as soon as we see a
quota_root, as opposed to after we are done setting up the qgroup
structures. In the quota_enable path, we wait until after the structures
are set up. Likewise, in disable, we clear the bit before tearing down
the structures. I feel that this organization is less surprising for the
open_ctree path.

I don't believe this fixes any actual bug, but avoids potential
confusion when using btrfs_qgroup_mode in an intermediate state where we
are enabled but haven't yet setup the qgroup status flags. It also
avoids any risk of calling a qgroup function and attempting to use the
qgroup rbtrees before they exist/are setup.

This all occurs before we do rw setup, so I believe it should be mostly
a no-op.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/disk-io.c |  1 -
 fs/btrfs/qgroup.c  | 15 +++++++--------
 2 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e2b0e11800fc..874685c84df2 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2254,7 +2254,6 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
 	root = btrfs_read_tree_root(tree_root, &location);
 	if (!IS_ERR(root)) {
 		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
-		set_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags);
 		fs_info->quota_root = root;
 	}
 
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index a8a603242431..1f915d70b99d 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -402,7 +402,7 @@ int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info)
 	u64 rescan_progress = 0;
 	bool simple;
 
-	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_DISABLED)
+	if (!fs_info->quota_root)
 		return 0;
 
 	fs_info->qgroup_ulist = ulist_alloc(GFP_KERNEL);
@@ -565,13 +565,12 @@ int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info)
 out:
 	btrfs_free_path(path);
 	fs_info->qgroup_flags |= flags;
-	if (!(fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_ON))
-		clear_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags);
-	else if (fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_RESCAN &&
-		 ret >= 0)
-		ret = qgroup_rescan_init(fs_info, rescan_progress, 0);
-
-	if (ret < 0) {
+	if (ret >= 0) {
+		if (fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_ON)
+			set_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags);
+		if (fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_RESCAN)
+			ret = qgroup_rescan_init(fs_info, rescan_progress, 0);
+	} else {
 		ulist_free(fs_info->qgroup_ulist);
 		fs_info->qgroup_ulist = NULL;
 		fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_RESCAN;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 02/18] btrfs: add new quota mode for simple quotas
  2023-07-27 22:12 ` [PATCH v5 02/18] btrfs: add new quota mode for simple quotas Boris Burkov
@ 2023-08-21 18:00   ` Josef Bacik
  2023-09-07 11:19   ` David Sterba
  1 sibling, 0 replies; 53+ messages in thread
From: Josef Bacik @ 2023-08-21 18:00 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Thu, Jul 27, 2023 at 03:12:49PM -0700, Boris Burkov wrote:
> Add a new quota mode called "simple quotas". It can be enabled by the
> existing quota enable ioctl via a new command, and sets an incompat
> bit, as the implementation of simple quotas will make backwards
> incompatible changes to the disk format of the extent tree.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/delayed-ref.c          |  4 +-
>  fs/btrfs/fs.h                   |  5 +-
>  fs/btrfs/ioctl.c                |  3 +-
>  fs/btrfs/qgroup.c               | 91 +++++++++++++++++++++++----------
>  fs/btrfs/qgroup.h               |  4 +-
>  fs/btrfs/root-tree.c            |  2 +-
>  fs/btrfs/transaction.c          |  4 +-
>  include/uapi/linux/btrfs.h      |  2 +
>  include/uapi/linux/btrfs_tree.h | 14 ++++-
>  9 files changed, 91 insertions(+), 38 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 6a13cf00218b..a9b938d3a531 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -898,7 +898,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
>  		return -ENOMEM;
>  	}
>  
> -	if (test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags) &&
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_DISABLED &&
>  	    !generic_ref->skip_qgroup) {
>  		record = kzalloc(sizeof(*record), GFP_NOFS);
>  		if (!record) {
> @@ -1002,7 +1002,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
>  		return -ENOMEM;
>  	}
>  
> -	if (test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags) &&
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_DISABLED &&
>  	    !generic_ref->skip_qgroup) {
>  		record = kzalloc(sizeof(*record), GFP_NOFS);
>  		if (!record) {
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index 203d2a267828..f76f450c2abf 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -218,7 +218,8 @@ enum {
>  	 BTRFS_FEATURE_INCOMPAT_NO_HOLES	|	\
>  	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID	|	\
>  	 BTRFS_FEATURE_INCOMPAT_RAID1C34	|	\
> -	 BTRFS_FEATURE_INCOMPAT_ZONED)
> +	 BTRFS_FEATURE_INCOMPAT_ZONED		|	\
> +	 BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA)
>  
>  #ifdef CONFIG_BTRFS_DEBUG
>  	/*
> @@ -233,7 +234,6 @@ enum {
>  
>  #define BTRFS_FEATURE_INCOMPAT_SUPP		\
>  	(BTRFS_FEATURE_INCOMPAT_SUPP_STABLE)
> -

Extraneous newline change.

>  #endif
>  
>  #define BTRFS_FEATURE_INCOMPAT_SAFE_SET			\
> @@ -790,7 +790,6 @@ struct btrfs_fs_info {
>  	struct lockdep_map btrfs_state_change_map[4];
>  	struct lockdep_map btrfs_trans_pending_ordered_map;
>  	struct lockdep_map btrfs_ordered_extent_map;
> -

Same here.  Once you fix it up you can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 03/18] btrfs: expose quota mode via sysfs
  2023-07-27 22:12 ` [PATCH v5 03/18] btrfs: expose quota mode via sysfs Boris Burkov
@ 2023-08-21 18:00   ` Josef Bacik
  2023-09-07 11:25   ` David Sterba
  1 sibling, 0 replies; 53+ messages in thread
From: Josef Bacik @ 2023-08-21 18:00 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Thu, Jul 27, 2023 at 03:12:50PM -0700, Boris Burkov wrote:
> Add a new sysfs file
> /sys/fs/btrfs/<uuid>/qgroups/mode
> which prints out the mode qgroups is running in. The possible modes are
> disabled, qgroup, and squota
> 
> Signed-off-by: Boris Burkov <boris@bur.io>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 04/18] btrfs: add simple_quota incompat feature to sysfs
  2023-07-27 22:12 ` [PATCH v5 04/18] btrfs: add simple_quota incompat feature to sysfs Boris Burkov
@ 2023-08-21 18:01   ` Josef Bacik
  2023-09-07 11:28   ` David Sterba
  1 sibling, 0 replies; 53+ messages in thread
From: Josef Bacik @ 2023-08-21 18:01 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Thu, Jul 27, 2023 at 03:12:51PM -0700, Boris Burkov wrote:
> Add an entry in the features directory for the new incompat flag
> 
> Signed-off-by: Boris Burkov <boris@bur.io>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 06/18] btrfs: create qgroup earlier in snapshot creation
  2023-07-27 22:12 ` [PATCH v5 06/18] btrfs: create qgroup earlier in snapshot creation Boris Burkov
@ 2023-08-21 18:02   ` Josef Bacik
  2023-09-07 11:41   ` David Sterba
  1 sibling, 0 replies; 53+ messages in thread
From: Josef Bacik @ 2023-08-21 18:02 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Thu, Jul 27, 2023 at 03:12:53PM -0700, Boris Burkov wrote:
> Pull creating the qgroup earlier in the snapshot. This allows simple
> quotas qgroups to see all the metadata writes related to the snapshot
> being created and to be born with the root node accounted.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 07/18] btrfs: function for recording simple quota deltas
  2023-07-27 22:12 ` [PATCH v5 07/18] btrfs: function for recording simple quota deltas Boris Burkov
@ 2023-08-21 18:04   ` Josef Bacik
  2023-09-07 11:46   ` David Sterba
  1 sibling, 0 replies; 53+ messages in thread
From: Josef Bacik @ 2023-08-21 18:04 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Thu, Jul 27, 2023 at 03:12:54PM -0700, Boris Burkov wrote:
> Rather than re-computing shared/exclusive ownership based on backrefs
> and walking roots for implicit backrefs, simple quotas does an increment
> when creating an extent and a decrement when deleting it. Add the API
> for the extent item code to use to track those events.
> 
> Also add a helper function to make collecting parent qgroups in a ulist
> easier for functions like this.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 09/18] btrfs: track owning root in btrfs_ref
  2023-07-27 22:12 ` [PATCH v5 09/18] btrfs: track owning root in btrfs_ref Boris Burkov
@ 2023-08-21 18:05   ` Josef Bacik
  0 siblings, 0 replies; 53+ messages in thread
From: Josef Bacik @ 2023-08-21 18:05 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Thu, Jul 27, 2023 at 03:12:56PM -0700, Boris Burkov wrote:
> While data extents require us to store additional inline refs to track
> the original owner on free, this information is available implicitly for
> metadata. It is found in the owner field of the header of the tree
> block. Even if other trees refer to this block and the original ref goes
> away, we will not rewrite that header field, so it will reliably give the
> original owner.
> 
> In addition, there is a relocation case where a new data extent needs to
> have an owning root separate from the referring root wired through
> delayed refs.
> 
> To use it for recording simple quota deltas, we need to wire this root
> id through from when we create the delayed ref until we fully process
> it. Store it in the generic btrfs_ref struct of the delayed ref.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 10/18] btrfs: track original extent owner in head_ref
  2023-07-27 22:12 ` [PATCH v5 10/18] btrfs: track original extent owner in head_ref Boris Burkov
@ 2023-08-21 18:06   ` Josef Bacik
  2023-09-07 11:54   ` David Sterba
  1 sibling, 0 replies; 53+ messages in thread
From: Josef Bacik @ 2023-08-21 18:06 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Thu, Jul 27, 2023 at 03:12:57PM -0700, Boris Burkov wrote:
> Simple quotas requires tracking the original creating root of any given
> extent. This gets complicated when multiple subvolumes create
> overlapping/contradictory refs in the same transaction. For example,
> due to modifying or deleting an extent while also snapshotting it.
> 
> To resolve this in a general way, take advantage of the fact that we are
> essentially already tracking this for handling releasing reservations.
> The head ref coalesces the various refs and uses must_insert_reserved to
> check if it needs to create an extent/free reservation. Store the ref
> that set must_insert_reserved as the owning ref on the head ref.
> 
> Note that this can result in writing an extent for the very first time
> with an owner different from its only ref, but it will look the same as
> if you first created it with the original owning ref, then added the
> other ref, then removed the owning ref.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 11/18] btrfs: new inline ref storing owning subvol of data extents
  2023-07-27 22:12 ` [PATCH v5 11/18] btrfs: new inline ref storing owning subvol of data extents Boris Burkov
@ 2023-08-21 18:07   ` Josef Bacik
  2023-09-07 12:06   ` David Sterba
  1 sibling, 0 replies; 53+ messages in thread
From: Josef Bacik @ 2023-08-21 18:07 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Thu, Jul 27, 2023 at 03:12:58PM -0700, Boris Burkov wrote:
> In order to implement simple quota groups, we need to be able to
> associate a data extent with the subvolume that created it. Once you
> account for reflink, this information cannot be recovered without
> explicitly storing it. Options for storing it are:
> - a new key/item
> - a new extent inline ref item
> 
> The former is backwards compatible, but wastes space, the latter is
> incompat, but is efficient in space and reuses the existing inline ref
> machinery, while only abusing it a tiny amount -- specifically, the new
> item is not a ref, per-se.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 13/18] btrfs: record simple quota deltas
  2023-07-27 22:13 ` [PATCH v5 13/18] btrfs: record simple quota deltas Boris Burkov
@ 2023-08-21 18:08   ` Josef Bacik
  2023-09-07 12:12   ` David Sterba
  1 sibling, 0 replies; 53+ messages in thread
From: Josef Bacik @ 2023-08-21 18:08 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Thu, Jul 27, 2023 at 03:13:00PM -0700, Boris Burkov wrote:
> At the moment that we run delayed refs, we make the final ref-count
> based decision on creating/removing extent (and metadata) items.
> Therefore, it is exactly the spot to hook up simple quotas.
> 
> There are a few important subtleties to the fields we must collect to
> accurately track simple quotas, particularly when removing an extent.
> When removing a data extent, the ref could be in any tree (due to
> reflink, for example) and so we need to recover the owning root id from
> the owner ref item. When removing a metadata extent, we know the owning
> root from the owner field in the header when we create the delayed ref,
> so we can recover it from there.
> 
> We must also be careful to handle reservations properly to not leaked
> reserved space. The happy path is freeing the reservation when the
> simple quota delta runs on a data extent. If that doesn't happen, due to
> refs canceling out or some error, the ref head already has the
> must_insert_reserved machinery to handle this, so we piggy back on that
> and use it to clean up the reserved data.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 14/18] btrfs: simple quota auto hierarchy for nested subvols
  2023-07-27 22:13 ` [PATCH v5 14/18] btrfs: simple quota auto hierarchy for nested subvols Boris Burkov
@ 2023-08-21 18:10   ` Josef Bacik
  2023-09-07 12:16   ` David Sterba
  1 sibling, 0 replies; 53+ messages in thread
From: Josef Bacik @ 2023-08-21 18:10 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Thu, Jul 27, 2023 at 03:13:01PM -0700, Boris Burkov wrote:
> Consider the following sequence:
> - enable quotas
> - create subvol S id 256 at dir outer/
> - create a qgroup 1/100
> - add 0/256 (S's auto qgroup) to 1/100
> - create subvol T id 257 at dir outer/inner/
> 
> With full qgroups, there is no relationship between 0/257 and either of
> 0/256 or 1/100. There is an inherit feature that the creator of inner/
> can use to specify it ought to be in 1/100.
> 
> Simple quotas are targeted at container isolation, where such automatic
> inheritance for not necessarily trusted/controlled nested subvol
> creation would be quite helpful. Therefore, add a new default behavior
> for simple quotas: when you create a nested subvol, automatically
> inherit as parents any parents of the qgroup of the subvol the new inode
> is going in.
> 
> In our example, 257/0 would also be under 1/100, allowing easy control
> of a total quota over an arbitrary hierarchy of subvolumes.
> 
> I think this _might_ be a generally useful behavior, so it could be
> interesting to put it behind a new inheritance flag that simple quotas
> always use while traditional quotas let the user specify, but this is a
> minimally intrusive change to start.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/ioctl.c       |  2 +-
>  fs/btrfs/qgroup.c      | 44 +++++++++++++++++++++++++++++++++++++++---
>  fs/btrfs/qgroup.h      |  6 +++---
>  fs/btrfs/transaction.c | 13 +++++++++----
>  4 files changed, 54 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 9b61bc62e439..c9b069077fd0 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -652,7 +652,7 @@ static noinline int create_subvol(struct mnt_idmap *idmap,
>  	/* Tree log can't currently deal with an inode which is a new root. */
>  	btrfs_set_log_full_commit(trans);
>  
> -	ret = btrfs_qgroup_inherit(trans, 0, objectid, inherit);
> +	ret = btrfs_qgroup_inherit(trans, 0, objectid, root->root_key.objectid, inherit);
>  	if (ret)
>  		goto out;
>  
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index dedc532669f4..58e9ed0deedd 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -1550,8 +1550,7 @@ static int quick_update_accounting(struct btrfs_fs_info *fs_info,
>  	return ret;
>  }
>  
> -int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src,
> -			      u64 dst)
> +int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src, u64 dst)
>  {
>  	struct btrfs_fs_info *fs_info = trans->fs_info;
>  	struct btrfs_qgroup *parent;
> @@ -2991,6 +2990,40 @@ int btrfs_run_qgroups(struct btrfs_trans_handle *trans)
>  	return ret;
>  }
>  
> +static int qgroup_auto_inherit(struct btrfs_fs_info *fs_info,
> +			       u64 inode_rootid,
> +			       struct btrfs_qgroup_inherit **inherit)
> +{
> +	int i = 0;
> +	u64 num_qgroups = 0;
> +	struct btrfs_qgroup *inode_qg;
> +	struct btrfs_qgroup_list *qg_list;
> +
> +	if (*inherit)
> +		return -EEXIST;
> +
> +	inode_qg = find_qgroup_rb(fs_info, inode_rootid);
> +	if (!inode_qg)
> +		return -ENOENT;
> +
> +	num_qgroups = list_count_nodes(&inode_qg->groups);
> +
> +	if (!num_qgroups)
> +		return 0;
> +
> +	*inherit = kzalloc(sizeof(**inherit) + num_qgroups * sizeof(u64), GFP_NOFS);
> +	if (!*inherit)
> +		return -ENOMEM;
> +	(*inherit)->num_qgroups = num_qgroups;
> +
> +	list_for_each_entry(qg_list, &inode_qg->groups, next_group) {
> +		u64 qg_id = qg_list->group->qgroupid;
> +		*((u64 *)((*inherit)+1) + i) = qg_id;
> +	}
> +
> +	return 0;
> +}
> +
>  /*
>   * Copy the accounting information between qgroups. This is necessary
>   * when a snapshot or a subvolume is created. Throwing an error will
> @@ -2998,7 +3031,8 @@ int btrfs_run_qgroups(struct btrfs_trans_handle *trans)
>   * when a readonly fs is a reasonable outcome.
>   */
>  int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
> -			 u64 objectid, struct btrfs_qgroup_inherit *inherit)
> +			 u64 objectid, u64 inode_rootid,
> +			 struct btrfs_qgroup_inherit *inherit)
>  {
>  	int ret = 0;
>  	int i;
> @@ -3040,6 +3074,9 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
>  		goto out;
>  	}
>  
> +	if (!inherit && btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
> +		qgroup_auto_inherit(fs_info, inode_rootid, &inherit);
> +
>  	if (inherit) {
>  		i_qgroups = (u64 *)(inherit + 1);
>  		nums = inherit->num_qgroups + 2 * inherit->num_ref_copies +
> @@ -3066,6 +3103,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
>  	if (ret)
>  		goto out;
>  
> +

Extraneous whitespace change.  Once fixed you can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 15/18] btrfs: check generation when recording simple quota delta
  2023-07-27 22:13 ` [PATCH v5 15/18] btrfs: check generation when recording simple quota delta Boris Burkov
@ 2023-08-21 18:11   ` Josef Bacik
  2023-09-07 12:24   ` David Sterba
  1 sibling, 0 replies; 53+ messages in thread
From: Josef Bacik @ 2023-08-21 18:11 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Thu, Jul 27, 2023 at 03:13:02PM -0700, Boris Burkov wrote:
> Simple quotas count extents only from the moment the feature is enabled.
> Therefore, if we do something like:
> 1. create subvol S
> 2. write F in S
> 3. enable quotas
> 4. remove F
> 5. write G in S
> 
> then after 3. and 4. we would expect the simple quota usage of S to be 0
> (putting aside some metadata extents that might be written) and after
> 5., it should be the size of G plus metadata. Therefore, we need to be
> able to determine whether a particular quota delta we are processing
> predates simple quota enablement.
> 
> To do this, store the transaction id when quotas were enabled. In
> fs_info for immediate use and in the quota status item to make it
> recoverable on mount. When we see a delta, check if the generation of
> the extent item is less than that of quota enablement. If so, we should
> ignore the delta from this extent.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 17/18] btrfs: track data relocation with simple quota
  2023-07-27 22:13 ` [PATCH v5 17/18] btrfs: track data relocation " Boris Burkov
@ 2023-08-21 18:16   ` Josef Bacik
  0 siblings, 0 replies; 53+ messages in thread
From: Josef Bacik @ 2023-08-21 18:16 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Thu, Jul 27, 2023 at 03:13:04PM -0700, Boris Burkov wrote:
> Relocation data allocations are quite tricky for simple quotas. The
> basic data relocation sequence is (ignoring details that aren't relevant
> to this fix):
> - create a fake relocation data fs root
> - create a fake relocation inode in that root
> - foreach data extent:
>   - preallocate a data extent on behalf of the fake inode
>   - copy over the data
> - foreach extent
>   - swap the refs so that the original file extent now refers to the new
>     extent item
> - drop the fake root, dropping its refs on the old extents, which lets
>   us delete them.
> 
> Done naively, this results in storing an extent item in the extent tree
> whose owner_ref points at the relocation data root and a no-op squota
> recording, since the reloc root is not a legit fstree. So far, that's
> OK. The problem comes when you do the swap, and leave an extent item
> owned by this bogus root as the real permanent extents of the file. If
> the file then drops that ref, we free it and no-op account that against
> the fake relocation root. Essentially, this means that relocation is
> simple quota "extent laundering", since we re-own the extents into a
> fake root.
> 
> Simple quotas very intentionally doesn't have a mechanism for
> transferring ownership of extents, as that is exactly the complicated
> thing we are trying to avoid with the new design. Further, it cannot be
> correctly done in this case, since at the time you create the new
> "real" refs, there is no way to know which was the original owner before
> relocation unless we track it.
> 
> Therefore, it makes more sense to trick the preallocation to handle
> relocation as a special case and note the proper owner ref from the
> beginning. That way, we never write out an extent item without the
> correct owner ref that it will eventually have.
> 
> This could be done by wiring a special root parameter all the way
> through the allocation code path, but to avoid that special case
> touching all the code, take advantage of the serial nature of relocation
> to store the src root on the relocation root object. Then when we finish
> the prealloc, if it happens to be this case, prepare the delayed ref
> appropriately.
> 
> We must also add logic to handle relocating adjacent extents with
> different owning roots. Those cannot be preallocated together in a
> cluster as it would lose the separate ownership information.
> 
> This is obviously a smelly bit of code, but I think it is the best
> solution to the problem, given the relocation implementation.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 18/18] btrfs: only set QUOTA_ENABLED when done reading qgroups
  2023-07-27 22:13 ` [PATCH v5 18/18] btrfs: only set QUOTA_ENABLED when done reading qgroups Boris Burkov
@ 2023-08-21 18:16   ` Josef Bacik
  0 siblings, 0 replies; 53+ messages in thread
From: Josef Bacik @ 2023-08-21 18:16 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Thu, Jul 27, 2023 at 03:13:05PM -0700, Boris Burkov wrote:
> In open_ctree, we set BTRFS_FS_QUOTA_ENABLED as soon as we see a
> quota_root, as opposed to after we are done setting up the qgroup
> structures. In the quota_enable path, we wait until after the structures
> are set up. Likewise, in disable, we clear the bit before tearing down
> the structures. I feel that this organization is less surprising for the
> open_ctree path.
> 
> I don't believe this fixes any actual bug, but avoids potential
> confusion when using btrfs_qgroup_mode in an intermediate state where we
> are enabled but haven't yet setup the qgroup status flags. It also
> avoids any risk of calling a qgroup function and attempting to use the
> qgroup rbtrees before they exist/are setup.
> 
> This all occurs before we do rw setup, so I believe it should be mostly
> a no-op.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 00/18] btrfs: simple quotas
  2023-07-27 22:12 [PATCH v5 00/18] btrfs: simple quotas Boris Burkov
                   ` (17 preceding siblings ...)
  2023-07-27 22:13 ` [PATCH v5 18/18] btrfs: only set QUOTA_ENABLED when done reading qgroups Boris Burkov
@ 2023-09-07 10:51 ` David Sterba
  2023-09-07 20:51   ` Boris Burkov
  2023-09-11 18:12   ` David Sterba
  18 siblings, 2 replies; 53+ messages in thread
From: David Sterba @ 2023-09-07 10:51 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Thu, Jul 27, 2023 at 03:12:47PM -0700, Boris Burkov wrote:
> btrfs quota groups (qgroups) are a compelling feature of btrfs that
> allow flexible control for limiting subvolume data and metadata usage.
> However, due to btrfs's high level decision to tradeoff snapshot
> performance against ref-counting performance, qgroups suffer from
> non-trivial performance issues that make them unattractive in certain
> workloads. Particularly, frequent backref walking during writes and
> during commits can make operations increasingly expensive as the number
> of snapshots scales up. For that reason, we have never been able to
> commit to using qgroups in production at Meta, despite significant
> interest from people running container workloads, where we would benefit
> from protecting the rest of the host from a buggy application in a
> container running away with disk usage. This patch series introduces a
> simplified version of qgroups called
> simple quotas (squotas) which never computes global reference counts
> for extents, and thus has similar performance characteristics to normal,
> quotas disabled, btrfs. The "trick" is that in simple quotas mode, we
> account all extents permanently to the subvolume in which they were
> originally created. That allows us to make all accounting 1:1 with
> extent item lifetime, removing the need to walk backrefs. However,
> this sacrifices the ability to compute shared vs. exclusive usage. It
> also results in counter-intuitive, though still predictable and simple
> accounting in the cases where an original extent is removed while a
> shared copy still exists. Qgroups is able to detect that case and count
> the remaining copy as an exclusive owner, while squotas is not. As a
> result, squotas works best when the original extent is immutable and
> outlives any clones.
> 
> ==Format Change==
> In order to track the original creating subvolume of a data extent in
> the face of reflinks, it is necessary to add additional accounting to
> the extent item. To save space, this is done with a new inline ref item.
> However, the downside of this approach is that it makes enabling squota
> an incompat change, denoted by the new incompat bit SIMPLE_QUOTA. When
> this bit is set and quotas are enabled, new extent items get the extra
> accounting, and freed extent items check for the accounting to find
> their creating subvolume. In addition, 1:1 with this incompat bit,
> the quota status item now tracks a "quota enablement generation" needed
> for properly handling deleting extents with predate enablement.
> 
> ==API==
> Squotas reuses the api of qgroups.

So apart from the accounting, the hierarchy of qgroups can be still
built as before, right? In the example you create a group 1/100 so I
assume that it's still qgroups from the outside, and that the limits can
be set.

Because if not, then squotas would make more sense as a separate
infrastructure, under quotas. Like that quotas are the abstraction while
qgroups or squota would be the implementation.

> The only difference is that when you
> enable quotas via `btrfs quota enable`, you pass the `--simple` flag.
> Squotas will always report exclusive == shared for each qgroup. Squotas
> deal with extent_item/metadata_item sizes and thus do not do anything
> special with compression. Squotas also introduce auto inheritance for
> nested subvols. The API is documented more fully in the documentation
> patches in btrfs-progs.

The lack of exclusive size sharing will be confusing I guess, so we need
to make it clear in the documentation and in the UI that it's either
full or simple mode.

I've added the patchset to for-next, we may need an iteration or two to
fix some issues I've seen so far but on the fundamental level I think
it's ok.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 02/18] btrfs: add new quota mode for simple quotas
  2023-07-27 22:12 ` [PATCH v5 02/18] btrfs: add new quota mode for simple quotas Boris Burkov
  2023-08-21 18:00   ` Josef Bacik
@ 2023-09-07 11:19   ` David Sterba
  1 sibling, 0 replies; 53+ messages in thread
From: David Sterba @ 2023-09-07 11:19 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Thu, Jul 27, 2023 at 03:12:49PM -0700, Boris Burkov wrote:
> Add a new quota mode called "simple quotas". It can be enabled by the
> existing quota enable ioctl via a new command, and sets an incompat
> bit, as the implementation of simple quotas will make backwards
> incompatible changes to the disk format of the extent tree.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/delayed-ref.c          |  4 +-
>  fs/btrfs/fs.h                   |  5 +-
>  fs/btrfs/ioctl.c                |  3 +-
>  fs/btrfs/qgroup.c               | 91 +++++++++++++++++++++++----------
>  fs/btrfs/qgroup.h               |  4 +-
>  fs/btrfs/root-tree.c            |  2 +-
>  fs/btrfs/transaction.c          |  4 +-
>  include/uapi/linux/btrfs.h      |  2 +
>  include/uapi/linux/btrfs_tree.h | 14 ++++-
>  9 files changed, 91 insertions(+), 38 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 6a13cf00218b..a9b938d3a531 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -898,7 +898,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
>  		return -ENOMEM;
>  	}
>  
> -	if (test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags) &&
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_DISABLED &&

The expression

"btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_DISABLED"

is repeated in many places, this should be a helper. Checking for the
secific mode open coded is fine.

>  	    !generic_ref->skip_qgroup) {
>  		record = kzalloc(sizeof(*record), GFP_NOFS);
>  		if (!record) {
> @@ -1002,7 +1002,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
>  		return -ENOMEM;
>  	}
>  
> -	if (test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags) &&
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_DISABLED &&
>  	    !generic_ref->skip_qgroup) {
>  		record = kzalloc(sizeof(*record), GFP_NOFS);
>  		if (!record) {
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index 203d2a267828..f76f450c2abf 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -218,7 +218,8 @@ enum {
>  	 BTRFS_FEATURE_INCOMPAT_NO_HOLES	|	\
>  	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID	|	\
>  	 BTRFS_FEATURE_INCOMPAT_RAID1C34	|	\
> -	 BTRFS_FEATURE_INCOMPAT_ZONED)
> +	 BTRFS_FEATURE_INCOMPAT_ZONED		|	\
> +	 BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA)
>  
>  #ifdef CONFIG_BTRFS_DEBUG
>  	/*
> @@ -233,7 +234,6 @@ enum {
>  
>  #define BTRFS_FEATURE_INCOMPAT_SUPP		\
>  	(BTRFS_FEATURE_INCOMPAT_SUPP_STABLE)
> -

Keep the lines there please

>  #endif
>  
>  #define BTRFS_FEATURE_INCOMPAT_SAFE_SET			\
> @@ -790,7 +790,6 @@ struct btrfs_fs_info {
>  	struct lockdep_map btrfs_state_change_map[4];
>  	struct lockdep_map btrfs_trans_pending_ordered_map;
>  	struct lockdep_map btrfs_ordered_extent_map;
> -

Same

>  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>  	spinlock_t ref_verify_lock;
>  	struct rb_root block_tree;
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index a895d105464b..9b61bc62e439 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -3691,7 +3691,8 @@ static long btrfs_ioctl_quota_ctl(struct file *file, void __user *arg)
>  
>  	switch (sa->cmd) {
>  	case BTRFS_QUOTA_CTL_ENABLE:
> -		ret = btrfs_quota_enable(fs_info);
> +	case BTRFS_QUOTA_CTL_ENABLE_SIMPLE_QUOTA:
> +		ret = btrfs_quota_enable(fs_info, sa);
>  		break;
>  	case BTRFS_QUOTA_CTL_DISABLE:
>  		ret = btrfs_quota_disable(fs_info);
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 0a2085ae9bcd..558f66994667 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -34,6 +34,8 @@ enum btrfs_qgroup_mode btrfs_qgroup_mode(struct btrfs_fs_info *fs_info)
>  {
>  	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
>  		return BTRFS_QGROUP_MODE_DISABLED;
> +	if (fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_SIMPLE)
> +		return BTRFS_QGROUP_MODE_SIMPLE;
>  	return BTRFS_QGROUP_MODE_FULL;
>  }
>  
> @@ -347,6 +349,8 @@ int btrfs_verify_qgroup_counts(struct btrfs_fs_info *fs_info, u64 qgroupid,
>  
>  static void qgroup_mark_inconsistent(struct btrfs_fs_info *fs_info)
>  {
> +	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
> +		return;
>  	fs_info->qgroup_flags |= (BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT |
>  				  BTRFS_QGROUP_RUNTIME_FLAG_CANCEL_RESCAN |
>  				  BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING);
> @@ -367,8 +371,9 @@ int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info)
>  	int ret = 0;
>  	u64 flags = 0;
>  	u64 rescan_progress = 0;
> +	bool simple;
>  
> -	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
> +	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_DISABLED)
>  		return 0;
>  
>  	fs_info->qgroup_ulist = ulist_alloc(GFP_KERNEL);
> @@ -418,14 +423,14 @@ int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info)
>  				 "old qgroup version, quota disabled");
>  				goto out;
>  			}
> +			fs_info->qgroup_flags = btrfs_qgroup_status_flags(l, ptr);
> +			simple = fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_SIMPLE;

bool expressions assigned should be written like

			x = (a & b);

so it's clear that it's not a trivial statement. Similarly for '==' or
ternary operator.

>  			if (btrfs_qgroup_status_generation(l, ptr) !=
> -			    fs_info->generation) {
> +			    fs_info->generation && !simple) {
>  				qgroup_mark_inconsistent(fs_info);
>  				btrfs_err(fs_info,
>  					"qgroup generation mismatch, marked as inconsistent");
>  			}
> -			fs_info->qgroup_flags = btrfs_qgroup_status_flags(l,
> -									  ptr);
>  			rescan_progress = btrfs_qgroup_status_rescan(l, ptr);
>  			goto next1;
>  		}
> @@ -557,7 +562,7 @@ bool btrfs_check_quota_leak(struct btrfs_fs_info *fs_info)
>  	struct rb_node *node;
>  	bool ret = false;
>  
> -	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
> +	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_DISABLED)
>  		return ret;
>  	/*
>  	 * Since we're unmounting, there is no race and no need to grab qgroup
> @@ -956,7 +961,8 @@ static int btrfs_clean_quota_tree(struct btrfs_trans_handle *trans,
>  	return ret;
>  }
>  
> -int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
> +int btrfs_quota_enable(struct btrfs_fs_info *fs_info,
> +		       struct btrfs_ioctl_quota_ctl_args *quota_ctl_args)
>  {
>  	struct btrfs_root *quota_root;
>  	struct btrfs_root *tree_root = fs_info->tree_root;
> @@ -968,6 +974,7 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
>  	struct btrfs_qgroup *qgroup = NULL;
>  	struct btrfs_trans_handle *trans = NULL;
>  	struct ulist *ulist = NULL;
> +	bool simple = quota_ctl_args->cmd == BTRFS_QUOTA_CTL_ENABLE_SIMPLE_QUOTA;

	const bool simple = ( ... == ...);

>  	int ret = 0;
>  	int slot;
>  
> @@ -1070,8 +1077,11 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
>  				 struct btrfs_qgroup_status_item);
>  	btrfs_set_qgroup_status_generation(leaf, ptr, trans->transid);
>  	btrfs_set_qgroup_status_version(leaf, ptr, BTRFS_QGROUP_STATUS_VERSION);
> -	fs_info->qgroup_flags = BTRFS_QGROUP_STATUS_FLAG_ON |
> -				BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
> +	fs_info->qgroup_flags = BTRFS_QGROUP_STATUS_FLAG_ON;
> +	if (simple)
> +		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_SIMPLE;
> +	else
> +		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
>  	btrfs_set_qgroup_status_flags(leaf, ptr, fs_info->qgroup_flags &
>  				      BTRFS_QGROUP_STATUS_FLAGS_MASK);
>  	btrfs_set_qgroup_status_rescan(leaf, ptr, 0);
> @@ -1187,8 +1197,14 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info)
>  	spin_lock(&fs_info->qgroup_lock);
>  	fs_info->quota_root = quota_root;
>  	set_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags);
> +	if (simple)
> +		btrfs_set_fs_incompat(fs_info, SIMPLE_QUOTA);
>  	spin_unlock(&fs_info->qgroup_lock);
>  
> +	/* Skip rescan for simple qgroups */
> +	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
> +		goto out_free_path;
> +
>  	ret = qgroup_rescan_init(fs_info, 0, 1);
>  	if (!ret) {
>  	        qgroup_rescan_zero_tracking(fs_info);
> @@ -1302,6 +1318,7 @@ int btrfs_quota_disable(struct btrfs_fs_info *fs_info)
>  	quota_root = fs_info->quota_root;
>  	fs_info->quota_root = NULL;
>  	fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_ON;
> +	fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_SIMPLE;
>  	fs_info->qgroup_drop_subtree_thres = BTRFS_MAX_LEVEL;
>  	spin_unlock(&fs_info->qgroup_lock);
>  
> @@ -1787,6 +1804,9 @@ int btrfs_qgroup_trace_extent_nolock(struct btrfs_fs_info *fs_info,
>  	struct btrfs_qgroup_extent_record *entry;
>  	u64 bytenr = record->bytenr;
>  
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
> +		return 0;
> +
>  	lockdep_assert_held(&delayed_refs->lock);
>  	trace_btrfs_qgroup_trace_extent(fs_info, record);
>  
> @@ -1819,6 +1839,8 @@ int btrfs_qgroup_trace_extent_post(struct btrfs_trans_handle *trans,
>  	struct btrfs_backref_walk_ctx ctx = { 0 };
>  	int ret;
>  
> +	if (btrfs_qgroup_mode(trans->fs_info) != BTRFS_QGROUP_MODE_FULL)
> +		return 0;
>  	/*
>  	 * We are always called in a context where we are already holding a
>  	 * transaction handle. Often we are called when adding a data delayed
> @@ -1874,7 +1896,7 @@ int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans, u64 bytenr,
>  	struct btrfs_delayed_ref_root *delayed_refs;
>  	int ret;
>  
> -	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags)
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL

Could this be written as an '==' condition?

>  	    || bytenr == 0 || num_bytes == 0)

With the rest it's not clear what exactly is it testing for.

>  		return 0;
>  	record = kzalloc(sizeof(*record), GFP_NOFS);
> @@ -1907,7 +1929,7 @@ int btrfs_qgroup_trace_leaf_items(struct btrfs_trans_handle *trans,
>  	u64 bytenr, num_bytes;
>  
>  	/* We can be called directly from walk_up_proc() */
> -	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)

Reading more from the patch, the meaning of the "!= FULL" is either
disabled or simple quotas, can this be inverted so we have something
like "if no accounting"? This covers either quotas disabled or simple
quotas.

>  		return 0;
>  
>  	for (i = 0; i < nr; i++) {
> @@ -2283,7 +2305,7 @@ static int qgroup_trace_subtree_swap(struct btrfs_trans_handle *trans,
>  	int level;
>  	int ret;
>  
> -	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
>  		return 0;
>  
>  	/* Wrong parameter order */
> @@ -2340,7 +2362,7 @@ int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle *trans,
>  	BUG_ON(root_level < 0 || root_level >= BTRFS_MAX_LEVEL);
>  	BUG_ON(root_eb == NULL);
>  
> -	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
>  		return 0;
>  
>  	spin_lock(&fs_info->qgroup_lock);
> @@ -2680,7 +2702,7 @@ int btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans, u64 bytenr,
>  	 * If quotas get disabled meanwhile, the resources need to be freed and
>  	 * we can't just exit here.
>  	 */
> -	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags) ||
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL ||
>  	    fs_info->qgroup_flags & BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING)
>  		goto out_free;
>  
> @@ -2768,6 +2790,9 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans)
>  	u64 qgroup_to_skip;
>  	int ret = 0;
>  
> +	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
> +		return 0;
> +
>  	delayed_refs = &trans->transaction->delayed_refs;
>  	qgroup_to_skip = delayed_refs->qgroup_to_skip;
>  	while ((node = rb_first(&delayed_refs->dirty_extent_root))) {
> @@ -2883,7 +2908,7 @@ int btrfs_run_qgroups(struct btrfs_trans_handle *trans)
>  			qgroup_mark_inconsistent(fs_info);
>  		spin_lock(&fs_info->qgroup_lock);
>  	}
> -	if (test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_DISABLED)
>  		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_ON;
>  	else
>  		fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_ON;
> @@ -2936,7 +2961,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
>  
>  	if (!committing)
>  		mutex_lock(&fs_info->qgroup_ioctl_lock);
> -	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
> +	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_DISABLED)
>  		goto out;
>  
>  	quota_root = fs_info->quota_root;
> @@ -3010,7 +3035,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
>  		qgroup_dirty(fs_info, dstgroup);
>  	}
>  
> -	if (srcid) {
> +	if (srcid && btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_FULL) {
>  		srcgroup = find_qgroup_rb(fs_info, srcid);
>  		if (!srcgroup)
>  			goto unlock;
> @@ -3302,6 +3327,9 @@ static int qgroup_rescan_leaf(struct btrfs_trans_handle *trans,
>  	int slot;
>  	int ret;
>  
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
> +		return 1;
> +
>  	mutex_lock(&fs_info->qgroup_rescan_lock);
>  	extent_root = btrfs_extent_root(fs_info,
>  				fs_info->qgroup_rescan_progress.objectid);
> @@ -3384,8 +3412,8 @@ static bool rescan_should_stop(struct btrfs_fs_info *fs_info)
>  {
>  	return btrfs_fs_closing(fs_info) ||
>  		test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state) ||
> -		!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags) ||
> -			  fs_info->qgroup_flags & BTRFS_QGROUP_RUNTIME_FLAG_CANCEL_RESCAN;
> +		btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_DISABLED ||
> +		fs_info->qgroup_flags & BTRFS_QGROUP_RUNTIME_FLAG_CANCEL_RESCAN;

The condition has become quite unreadable, please convert it to an 'if'
or series of more 'ifs' grouping the conditions. Eg. the
btrfs_fs_closing() can be a separate statement as it's not directly
related to the quotas and rescan.

>  }
>  
>  static void btrfs_qgroup_rescan_worker(struct btrfs_work *work)
> @@ -3399,6 +3427,9 @@ static void btrfs_qgroup_rescan_worker(struct btrfs_work *work)
>  	bool stopped = false;
>  	bool did_leaf_rescans = false;
>  
> +	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
> +		return;
> +
>  	path = btrfs_alloc_path();
>  	if (!path)
>  		goto out;
> @@ -3502,6 +3533,12 @@ qgroup_rescan_init(struct btrfs_fs_info *fs_info, u64 progress_objectid,
>  {
>  	int ret = 0;
>  
> +	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE) {
> +		btrfs_warn(fs_info, "qgroup rescan init failed, running in simple mode. mode: %d\n",

No "\n" in the message helpers and I don't think we need the numeric
value of the mode when it's stated in words

> +			btrfs_qgroup_mode(fs_info));
> +		return -EINVAL;
> +	}
> +
>  	if (!init_flags) {
>  		/* we're resuming qgroup rescan at mount time */
>  		if (!(fs_info->qgroup_flags &
> @@ -3532,7 +3569,7 @@ qgroup_rescan_init(struct btrfs_fs_info *fs_info, u64 progress_objectid,
>  			btrfs_warn(fs_info,
>  			"qgroup rescan init failed, qgroup is not enabled");
>  			ret = -EINVAL;
> -		} else if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags)) {
> +		} else if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_DISABLED) {
>  			/* Quota disable is in progress */
>  			ret = -EBUSY;
>  		}
> @@ -3788,7 +3825,7 @@ static int qgroup_reserve_data(struct btrfs_inode *inode,
>  	u64 to_reserve;
>  	int ret;
>  
> -	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &root->fs_info->flags) ||
> +	if (btrfs_qgroup_mode(root->fs_info) == BTRFS_QGROUP_MODE_DISABLED ||
>  	    !is_fstree(root->root_key.objectid) || len == 0)
>  		return 0;
>  
> @@ -3920,7 +3957,7 @@ static int __btrfs_qgroup_release_data(struct btrfs_inode *inode,
>  	int trace_op = QGROUP_RELEASE;
>  	int ret;
>  
> -	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &inode->root->fs_info->flags))
> +	if (btrfs_qgroup_mode(inode->root->fs_info) == BTRFS_QGROUP_MODE_DISABLED)
>  		return 0;
>  
>  	/* In release case, we shouldn't have @reserved */
> @@ -4031,7 +4068,7 @@ int btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
>  	struct btrfs_fs_info *fs_info = root->fs_info;
>  	int ret;
>  
> -	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags) ||
> +	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_DISABLED ||
>  	    !is_fstree(root->root_key.objectid) || num_bytes == 0)
>  		return 0;
>  
> @@ -4072,7 +4109,7 @@ void btrfs_qgroup_free_meta_all_pertrans(struct btrfs_root *root)
>  {
>  	struct btrfs_fs_info *fs_info = root->fs_info;
>  
> -	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags) ||
> +	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_DISABLED ||
>  	    !is_fstree(root->root_key.objectid))
>  		return;
>  
> @@ -4088,7 +4125,7 @@ void __btrfs_qgroup_free_meta(struct btrfs_root *root, int num_bytes,
>  {
>  	struct btrfs_fs_info *fs_info = root->fs_info;
>  
> -	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags) ||
> +	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_DISABLED ||
>  	    !is_fstree(root->root_key.objectid))
>  		return;
>  
> @@ -4153,7 +4190,7 @@ void btrfs_qgroup_convert_reserved_meta(struct btrfs_root *root, int num_bytes)
>  {
>  	struct btrfs_fs_info *fs_info = root->fs_info;
>  
> -	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags) ||
> +	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_DISABLED ||
>  	    !is_fstree(root->root_key.objectid))
>  		return;
>  	/* Same as btrfs_qgroup_free_meta_prealloc() */
> @@ -4261,7 +4298,7 @@ int btrfs_qgroup_add_swapped_blocks(struct btrfs_trans_handle *trans,
>  	int level = btrfs_header_level(subvol_parent) - 1;
>  	int ret = 0;
>  
> -	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
>  		return 0;
>  
>  	if (btrfs_node_ptr_generation(subvol_parent, subvol_slot) >
> @@ -4371,7 +4408,7 @@ int btrfs_qgroup_trace_subtree_after_cow(struct btrfs_trans_handle *trans,
>  	int ret = 0;
>  	int i;
>  
> -	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
>  		return 0;
>  	if (!is_fstree(root->root_key.objectid) || !root->reloc_root)
>  		return 0;
> diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
> index bb15e55f00b8..d4c4d039585f 100644
> --- a/fs/btrfs/qgroup.h
> +++ b/fs/btrfs/qgroup.h
> @@ -249,13 +249,15 @@ enum {
>  	ENUM_BIT(QGROUP_FREE),
>  };
>  
> -int btrfs_quota_enable(struct btrfs_fs_info *fs_info);
>  enum btrfs_qgroup_mode {
>  	BTRFS_QGROUP_MODE_DISABLED,
>  	BTRFS_QGROUP_MODE_FULL,
> +	BTRFS_QGROUP_MODE_SIMPLE
>  };
>  
>  enum btrfs_qgroup_mode btrfs_qgroup_mode(struct btrfs_fs_info *fs_info);
> +int btrfs_quota_enable(struct btrfs_fs_info *fs_info,
> +		       struct btrfs_ioctl_quota_ctl_args *quota_ctl_args);
>  int btrfs_quota_disable(struct btrfs_fs_info *fs_info);
>  int btrfs_qgroup_rescan(struct btrfs_fs_info *fs_info);
>  void btrfs_qgroup_rescan_resume(struct btrfs_fs_info *fs_info);
> diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
> index 859874579456..044a8c2710f8 100644
> --- a/fs/btrfs/root-tree.c
> +++ b/fs/btrfs/root-tree.c
> @@ -508,7 +508,7 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root *root,
>  	struct btrfs_fs_info *fs_info = root->fs_info;
>  	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
>  
> -	if (test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags)) {
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_DISABLED) {
>  		/* One for parent inode, two for dir entries */
>  		qgroup_num_bytes = 3 * fs_info->nodesize;
>  		ret = btrfs_qgroup_reserve_meta_prealloc(root,
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 815f61d6b506..89ff15aa085f 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1529,11 +1529,11 @@ static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
>  	int ret;
>  
>  	/*
> -	 * Save some performance in the case that qgroups are not
> +	 * Save some performance in the case that full qgroups are not
>  	 * enabled. If this check races with the ioctl, rescan will
>  	 * kick in anyway.
>  	 */
> -	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
>  		return 0;
>  
>  	/*
> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> index dbb8b96da50d..0e42f4a2121d 100644
> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -333,6 +333,7 @@ struct btrfs_ioctl_fs_info_args {
>  #define BTRFS_FEATURE_INCOMPAT_RAID1C34		(1ULL << 11)
>  #define BTRFS_FEATURE_INCOMPAT_ZONED		(1ULL << 12)
>  #define BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2	(1ULL << 13)
> +#define BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA	(1ULL << 14)
>  
>  struct btrfs_ioctl_feature_flags {
>  	__u64 compat_flags;
> @@ -753,6 +754,7 @@ struct btrfs_ioctl_get_dev_stats {
>  #define BTRFS_QUOTA_CTL_ENABLE	1
>  #define BTRFS_QUOTA_CTL_DISABLE	2
>  #define BTRFS_QUOTA_CTL_RESCAN__NOTUSED	3
> +#define BTRFS_QUOTA_CTL_ENABLE_SIMPLE_QUOTA 4
>  struct btrfs_ioctl_quota_ctl_args {
>  	__u64 cmd;
>  	__u64 status;
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index ab38d0f411fa..47aca414a41b 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -1200,9 +1200,21 @@ static inline __u16 btrfs_qgroup_level(__u64 qgroupid)
>   */
>  #define BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT	(1ULL << 2)
>  
> +/*
> + * 3rd and 4th bits taken by non-persisted status flags in qgroup.h
> + */

If the bits are not persisted and used only internally or for in-memory
tracking then they should be renumbered so you can use the value (1ULL
<< 3) in sequence.

> +
> +/*
> + * Whether or not this filesystem is using simple quotas.
> + * Not exactly the incompat bit, because we support using simple quotas,
> + * disabling it, then going back to full qgroup quotas.
> + */
> +#define BTRFS_QGROUP_STATUS_FLAG_SIMPLE	(1ULL << 5)

Here the context of 'SIMPLE' is not obvious, it's referring to
accounting but adding that to the identifier would make it quite long so
I'm not sure if we should do that. OTOH it's in the ioctl public API so
clarity is important.

> +
>  #define BTRFS_QGROUP_STATUS_FLAGS_MASK	(BTRFS_QGROUP_STATUS_FLAG_ON |		\
>  					 BTRFS_QGROUP_STATUS_FLAG_RESCAN |	\
> -					 BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT)
> +					 BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT |	\
> +					 BTRFS_QGROUP_STATUS_FLAG_SIMPLE)
>  
>  #define BTRFS_QGROUP_STATUS_VERSION        1
>  
> -- 
> 2.41.0

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 03/18] btrfs: expose quota mode via sysfs
  2023-07-27 22:12 ` [PATCH v5 03/18] btrfs: expose quota mode via sysfs Boris Burkov
  2023-08-21 18:00   ` Josef Bacik
@ 2023-09-07 11:25   ` David Sterba
  1 sibling, 0 replies; 53+ messages in thread
From: David Sterba @ 2023-09-07 11:25 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Thu, Jul 27, 2023 at 03:12:50PM -0700, Boris Burkov wrote:
> Add a new sysfs file
> /sys/fs/btrfs/<uuid>/qgroups/mode
> which prints out the mode qgroups is running in. The possible modes are
> disabled, qgroup, and squota

Can you get the 'disabled' at all? Because when the quotas are disabled
by ioctl the whole sysfs directory is gone, see btrfs_free_qgroup_config().

> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/sysfs.c | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> index b1d1ac25237b..e53614753391 100644
> --- a/fs/btrfs/sysfs.c
> +++ b/fs/btrfs/sysfs.c
> @@ -2086,6 +2086,31 @@ static ssize_t qgroup_enabled_show(struct kobject *qgroups_kobj,
>  }
>  BTRFS_ATTR(qgroups, enabled, qgroup_enabled_show);
>  
> +static ssize_t qgroup_mode_show(struct kobject *qgroups_kobj,
> +				struct kobj_attribute *a,
> +				char *buf)
> +{
> +	struct btrfs_fs_info *fs_info = to_fs_info(qgroups_kobj->parent);
> +	char *mode = "";
> +
> +	spin_lock(&fs_info->qgroup_lock);
> +	switch (btrfs_qgroup_mode(fs_info)) {
> +	case BTRFS_QGROUP_MODE_DISABLED:
> +		mode = "disabled";
> +		break;
> +	case BTRFS_QGROUP_MODE_FULL:
> +		mode = "qgroup";
> +		break;
> +	case BTRFS_QGROUP_MODE_SIMPLE:
> +		mode = "squota";

You can do

	lock;
	switch (mode) {
	case FULL:   sysfs_emit(buf, "qgroup\n"); break;
	case SIMPLE: sysfs_emit(buf, "simple\n"); break;
	}
	unlock;
	return 7;

or track the return value from sysfs_emit so it's not so hacky.

> +		break;
> +	}
> +	spin_unlock(&fs_info->qgroup_lock);
> +
> +	return sysfs_emit(buf, "%s\n", mode);
> +}
> +BTRFS_ATTR(qgroups, mode, qgroup_mode_show);
> +
>  static ssize_t qgroup_inconsistent_show(struct kobject *qgroups_kobj,
>  					struct kobj_attribute *a,
>  					char *buf)
> @@ -2148,6 +2173,7 @@ static struct attribute *qgroups_attrs[] = {
>  	BTRFS_ATTR_PTR(qgroups, enabled),
>  	BTRFS_ATTR_PTR(qgroups, inconsistent),
>  	BTRFS_ATTR_PTR(qgroups, drop_subtree_threshold),
> +	BTRFS_ATTR_PTR(qgroups, mode),
>  	NULL
>  };
>  ATTRIBUTE_GROUPS(qgroups);
> -- 
> 2.41.0

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 04/18] btrfs: add simple_quota incompat feature to sysfs
  2023-07-27 22:12 ` [PATCH v5 04/18] btrfs: add simple_quota incompat feature to sysfs Boris Burkov
  2023-08-21 18:01   ` Josef Bacik
@ 2023-09-07 11:28   ` David Sterba
  2023-09-07 20:56     ` Boris Burkov
  1 sibling, 1 reply; 53+ messages in thread
From: David Sterba @ 2023-09-07 11:28 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Thu, Jul 27, 2023 at 03:12:51PM -0700, Boris Burkov wrote:
> Add an entry in the features directory for the new incompat flag
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/sysfs.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> index e53614753391..f62bba0068ca 100644
> --- a/fs/btrfs/sysfs.c
> +++ b/fs/btrfs/sysfs.c
> @@ -291,6 +291,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(metadata_uuid, METADATA_UUID);
>  BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
>  BTRFS_FEAT_ATTR_COMPAT_RO(block_group_tree, BLOCK_GROUP_TREE);
>  BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
> +BTRFS_FEAT_ATTR_INCOMPAT(simple_quota, SIMPLE_QUOTA);

I'm not sure if you mentioned in the cover letter or if we had discussed
it before, but does this need to be a full incompat bit? I.e. no mount
on older kernels, compared to a COMPAT_RO which would allow
read-only mount.

>  #ifdef CONFIG_BLK_DEV_ZONED
>  BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
>  #endif
> @@ -322,6 +323,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
>  	BTRFS_FEAT_ATTR_PTR(free_space_tree),
>  	BTRFS_FEAT_ATTR_PTR(raid1c34),
>  	BTRFS_FEAT_ATTR_PTR(block_group_tree),
> +	BTRFS_FEAT_ATTR_PTR(simple_quota),
>  #ifdef CONFIG_BLK_DEV_ZONED
>  	BTRFS_FEAT_ATTR_PTR(zoned),
>  #endif
> -- 
> 2.41.0

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 06/18] btrfs: create qgroup earlier in snapshot creation
  2023-07-27 22:12 ` [PATCH v5 06/18] btrfs: create qgroup earlier in snapshot creation Boris Burkov
  2023-08-21 18:02   ` Josef Bacik
@ 2023-09-07 11:41   ` David Sterba
  2023-09-08 22:50     ` Boris Burkov
  1 sibling, 1 reply; 53+ messages in thread
From: David Sterba @ 2023-09-07 11:41 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Thu, Jul 27, 2023 at 03:12:53PM -0700, Boris Burkov wrote:
> Pull creating the qgroup earlier in the snapshot. This allows simple
> quotas qgroups to see all the metadata writes related to the snapshot
> being created and to be born with the root node accounted.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/qgroup.c      | 3 +++
>  fs/btrfs/transaction.c | 6 ++++++
>  2 files changed, 9 insertions(+)
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 18f521716e8d..8e3a4ced3077 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -1672,6 +1672,9 @@ int btrfs_create_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid)
>  	struct btrfs_qgroup *qgroup;
>  	int ret = 0;
>  
> +	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_DISABLED)
> +		return 0;
> +
>  	mutex_lock(&fs_info->qgroup_ioctl_lock);
>  	if (!fs_info->quota_root) {
>  		ret = -ENOTCONN;
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 89ff15aa085f..25217888e897 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1722,6 +1722,12 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
>  	}
>  	btrfs_release_path(path);
>  
> +	ret = btrfs_create_qgroup(trans, objectid);
> +	if (ret) {
> +		btrfs_abort_transaction(trans, ret);

This adds and error case to the middle of a transaction commit.
Snapshots are created in two parts, first is the ioctl adding the
structure and then commit actually creates that. So the first phase
preallocates what's needed (the root_item and path) and should do the
same with the qgroups as much as possible.

Also check all the things that btrfs_create_qgroup() does, searches the
qgroup tree, adds the new item, takes the qgroup_ioctl_lock mutex, and
adds the sysfs entry (that does allocations under GFP_KERNEL).
If you really need to create the qgroup like that then it needs much
more care.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 07/18] btrfs: function for recording simple quota deltas
  2023-07-27 22:12 ` [PATCH v5 07/18] btrfs: function for recording simple quota deltas Boris Burkov
  2023-08-21 18:04   ` Josef Bacik
@ 2023-09-07 11:46   ` David Sterba
  1 sibling, 0 replies; 53+ messages in thread
From: David Sterba @ 2023-09-07 11:46 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Thu, Jul 27, 2023 at 03:12:54PM -0700, Boris Burkov wrote:
> Rather than re-computing shared/exclusive ownership based on backrefs
> and walking roots for implicit backrefs, simple quotas does an increment
> when creating an extent and a decrement when deleting it. Add the API
> for the extent item code to use to track those events.
> 
> Also add a helper function to make collecting parent qgroups in a ulist
> easier for functions like this.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/qgroup.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/qgroup.h | 11 ++++++-
>  2 files changed, 83 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 8e3a4ced3077..dedc532669f4 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -332,6 +332,35 @@ static int del_relation_rb(struct btrfs_fs_info *fs_info,
>  	return -ENOENT;
>  }
>  
> +static int qgroup_collect_parents(struct btrfs_qgroup *qgroup,
> +				  struct ulist *ul)
> +{
> +	struct ulist_iterator uiter;
> +	struct ulist_node *unode;
> +	struct btrfs_qgroup_list *glist;
> +	struct btrfs_qgroup *qg;
> +	int ret = 0;
> +
> +	ulist_reinit(ul);
> +	ret = ulist_add(ul, qgroup->qgroupid,
> +			qgroup_to_aux(qgroup), GFP_ATOMIC);

Qu has sent a series to get rid of the GFP_ATOMIC allocations when
processing qgruops, so this would be good to port to the qgroup
iterators as well but it's a recent change and can be done later as an
optimization.

> +	if (ret < 0)
> +		goto out;
> +	ULIST_ITER_INIT(&uiter);
> +	while ((unode = ulist_next(ul, &uiter))) {
> +		qg = unode_aux_to_qgroup(unode);
> +		list_for_each_entry(glist, &qg->groups, next_group) {
> +			ret = ulist_add(ul, glist->group->qgroupid,
> +					qgroup_to_aux(glist->group), GFP_ATOMIC);
> +			if (ret < 0)
> +				goto out;
> +		}
> +	}
> +	ret = 0;
> +out:
> +	return ret;
> +}
> +
>  #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
>  int btrfs_verify_qgroup_counts(struct btrfs_fs_info *fs_info, u64 qgroupid,
>  			       u64 rfer, u64 excl)
> @@ -4535,3 +4564,47 @@ void btrfs_qgroup_destroy_extent_records(struct btrfs_transaction *trans)
>  	}
>  	*root = RB_ROOT;
>  }
> +
> +int btrfs_record_simple_quota_delta(struct btrfs_fs_info *fs_info,

You can abbreviate all the 'simple_quota' in identifiers as 'squota'.

> +				    struct btrfs_simple_quota_delta *delta)
> +{
> +	int ret;
> +	struct ulist *ul = fs_info->qgroup_ulist;
> +	struct btrfs_qgroup *qgroup;
> +	struct ulist_iterator uiter;
> +	struct ulist_node *unode;
> +	struct btrfs_qgroup *qg;
> +	u64 root = delta->root;
> +	u64 num_bytes = delta->num_bytes;
> +	int sign = delta->is_inc ? 1 : -1;

	const int sign = (delta->is_inc ? 1 : -1);

> +
> +	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_SIMPLE)
> +		return 0;
> +
> +	if (!is_fstree(root))
> +		return 0;
> +
> +	spin_lock(&fs_info->qgroup_lock);
> +	qgroup = find_qgroup_rb(fs_info, root);
> +	if (!qgroup) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +	ret = qgroup_collect_parents(qgroup, ul);
> +	if (ret)
> +		goto out;
> +
> +	ULIST_ITER_INIT(&uiter);
> +	while ((unode = ulist_next(ul, &uiter))) {
> +		qg = unode_aux_to_qgroup(unode);
> +		qg->excl += num_bytes * sign;
> +		qg->rfer += num_bytes * sign;
> +		qgroup_dirty(fs_info, qg);
> +	}
> +
> +out:
> +	spin_unlock(&fs_info->qgroup_lock);
> +	if (!ret && delta->rsv_bytes)
> +		btrfs_qgroup_free_refroot(fs_info, root, delta->rsv_bytes, BTRFS_QGROUP_RSV_DATA);
> +	return ret;
> +}
> diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
> index d4c4d039585f..94d85b4fbebd 100644
> --- a/fs/btrfs/qgroup.h
> +++ b/fs/btrfs/qgroup.h
> @@ -235,6 +235,14 @@ struct btrfs_qgroup {
>  	struct kobject kobj;
>  };
>  
> +struct btrfs_simple_quota_delta {

struct btrfs_squota_delta

> +	u64 root; /* The fstree root this delta counts against */
> +	u64 num_bytes; /* The number of bytes in the extent being counted */
> +	u64 rsv_bytes; /* The number of bytes reserved for this extent */
> +	bool is_inc; /* Whether we are using or freeing the extent */
> +	bool is_data; /* Whether the extent is data or metadata */

Please put the comments on separate lines before the struct member definitions.

> +};
> +
>  static inline u64 btrfs_qgroup_subvolid(u64 qgroupid)
>  {
>  	return (qgroupid & ((1ULL << BTRFS_QGROUP_LEVEL_SHIFT) - 1));
> @@ -447,5 +455,6 @@ int btrfs_qgroup_trace_subtree_after_cow(struct btrfs_trans_handle *trans,
>  		struct btrfs_root *root, struct extent_buffer *eb);
>  void btrfs_qgroup_destroy_extent_records(struct btrfs_transaction *trans);
>  bool btrfs_check_quota_leak(struct btrfs_fs_info *fs_info);
> -
> +int btrfs_record_simple_quota_delta(struct btrfs_fs_info *fs_info,
> +				    struct btrfs_simple_quota_delta *delta);

Please keep the newline before the last #endif

>  #endif
> -- 
> 2.41.0

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 10/18] btrfs: track original extent owner in head_ref
  2023-07-27 22:12 ` [PATCH v5 10/18] btrfs: track original extent owner in head_ref Boris Burkov
  2023-08-21 18:06   ` Josef Bacik
@ 2023-09-07 11:54   ` David Sterba
  1 sibling, 0 replies; 53+ messages in thread
From: David Sterba @ 2023-09-07 11:54 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Thu, Jul 27, 2023 at 03:12:57PM -0700, Boris Burkov wrote:
> Simple quotas requires tracking the original creating root of any given
> extent. This gets complicated when multiple subvolumes create
> overlapping/contradictory refs in the same transaction. For example,
> due to modifying or deleting an extent while also snapshotting it.
> 
> To resolve this in a general way, take advantage of the fact that we are
> essentially already tracking this for handling releasing reservations.
> The head ref coalesces the various refs and uses must_insert_reserved to
> check if it needs to create an extent/free reservation. Store the ref
> that set must_insert_reserved as the owning ref on the head ref.
> 
> Note that this can result in writing an extent for the very first time
> with an owner different from its only ref, but it will look the same as
> if you first created it with the original owning ref, then added the
> other ref, then removed the owning ref.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/delayed-ref.c | 20 ++++++++++++++++----
>  fs/btrfs/delayed-ref.h |  7 +++++++
>  2 files changed, 23 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index f0bae1e1c455..28ba7a9eb3c3 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -623,6 +623,16 @@ static noinline void update_existing_head_ref(struct btrfs_trans_handle *trans,
>  	BUG_ON(existing->is_data != update->is_data);
>  
>  	spin_lock(&existing->lock);
> +
> +	/*
> +	 * When freeing an extent, we may not know the owning root
> +	 * when we first create the head_ref. However, some deref before the
> +	 * last deref will know it, so we just need to update the head_ref
> +	 * accordingly

We now write comments as full sentences so please use a "." at the end.

> +	 */
> +	if (!existing->owning_root)
> +		existing->owning_root = update->owning_root;
> +
>  	if (update->must_insert_reserved) {
>  		/* if the extent was freed and then
>  		 * reallocated before the delayed ref
> @@ -632,6 +642,7 @@ static noinline void update_existing_head_ref(struct btrfs_trans_handle *trans,
>  		 * Set it again here
>  		 */
>  		existing->must_insert_reserved = update->must_insert_reserved;
> +		existing->owning_root = update->owning_root;
>  
>  		/*
>  		 * update the num_bytes so we make sure the accounting
> @@ -694,7 +705,7 @@ static void init_delayed_ref_head(struct btrfs_delayed_ref_head *head_ref,
>  				  struct btrfs_qgroup_extent_record *qrecord,
>  				  u64 bytenr, u64 num_bytes, u64 ref_root,
>  				  u64 reserved, int action, bool is_data,
> -				  bool is_system)
> +				  bool is_system, u64 owning_root)
>  {
>  	int count_mod = 1;
>  	bool must_insert_reserved = false;
> @@ -735,6 +746,7 @@ static void init_delayed_ref_head(struct btrfs_delayed_ref_head *head_ref,
>  	head_ref->num_bytes = num_bytes;
>  	head_ref->ref_mod = count_mod;
>  	head_ref->must_insert_reserved = must_insert_reserved;
> +	head_ref->owning_root = owning_root;
>  	head_ref->is_data = is_data;
>  	head_ref->is_system = is_system;
>  	head_ref->ref_tree = RB_ROOT_CACHED;
> @@ -922,7 +934,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
>  
>  	init_delayed_ref_head(head_ref, record, bytenr, num_bytes,
>  			      generic_ref->tree_ref.ref_root, 0, action,
> -			      false, is_system);
> +			      false, is_system, generic_ref->owning_root);
>  	head_ref->extent_op = extent_op;
>  
>  	delayed_refs = &trans->transaction->delayed_refs;
> @@ -1014,7 +1026,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
>  	}
>  
>  	init_delayed_ref_head(head_ref, record, bytenr, num_bytes, ref_root,
> -			      reserved, action, true, false);
> +			      reserved, action, true, false, generic_ref->owning_root);
>  	head_ref->extent_op = NULL;
>  
>  	delayed_refs = &trans->transaction->delayed_refs;
> @@ -1060,7 +1072,7 @@ int btrfs_add_delayed_extent_op(struct btrfs_trans_handle *trans,
>  		return -ENOMEM;
>  
>  	init_delayed_ref_head(head_ref, NULL, bytenr, num_bytes, 0, 0,
> -			      BTRFS_UPDATE_DELAYED_HEAD, false, false);
> +			      BTRFS_UPDATE_DELAYED_HEAD, false, false, 0);
>  	head_ref->extent_op = extent_op;
>  
>  	delayed_refs = &trans->transaction->delayed_refs;
> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> index 0729850a9193..71f0a6e5d583 100644
> --- a/fs/btrfs/delayed-ref.h
> +++ b/fs/btrfs/delayed-ref.h
> @@ -117,6 +117,13 @@ struct btrfs_delayed_ref_head {
>  	 * the free has happened.
>  	 */
>  	bool must_insert_reserved;
> +
> +	/*
> +	 * The root which triggered the allocation when
> +	 * must_insert_reserved is true
> +	 */
> +	u64 owning_root;

Please reorder this so it's not among bools, this would create a gap of
7 bytes before that.

> +
>  	bool is_data;
>  	bool is_system;
>  	bool processing;
> -- 
> 2.41.0

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 11/18] btrfs: new inline ref storing owning subvol of data extents
  2023-07-27 22:12 ` [PATCH v5 11/18] btrfs: new inline ref storing owning subvol of data extents Boris Burkov
  2023-08-21 18:07   ` Josef Bacik
@ 2023-09-07 12:06   ` David Sterba
  1 sibling, 0 replies; 53+ messages in thread
From: David Sterba @ 2023-09-07 12:06 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Thu, Jul 27, 2023 at 03:12:58PM -0700, Boris Burkov wrote:
> In order to implement simple quota groups, we need to be able to
> associate a data extent with the subvolume that created it. Once you
> account for reflink, this information cannot be recovered without
> explicitly storing it. Options for storing it are:
> - a new key/item
> - a new extent inline ref item
> 
> The former is backwards compatible, but wastes space, the latter is
> incompat, but is efficient in space and reuses the existing inline ref
> machinery, while only abusing it a tiny amount -- specifically, the new
> item is not a ref, per-se.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/accessors.h            |  4 +++
>  fs/btrfs/backref.c              |  3 ++
>  fs/btrfs/extent-tree.c          | 56 ++++++++++++++++++++++++++-------
>  fs/btrfs/print-tree.c           | 12 +++++++
>  fs/btrfs/ref-verify.c           |  3 ++
>  fs/btrfs/tree-checker.c         |  3 ++
>  include/uapi/linux/btrfs_tree.h |  6 ++++
>  7 files changed, 76 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
> index 8cfc8214109c..a23045c05937 100644
> --- a/fs/btrfs/accessors.h
> +++ b/fs/btrfs/accessors.h
> @@ -349,6 +349,8 @@ BTRFS_SETGET_FUNCS(extent_data_ref_count, struct btrfs_extent_data_ref, count, 3
>  
>  BTRFS_SETGET_FUNCS(shared_data_ref_count, struct btrfs_shared_data_ref, count, 32);
>  
> +BTRFS_SETGET_FUNCS(extent_owner_ref_root_id, struct btrfs_extent_owner_ref, root_id, 64);
> +
>  BTRFS_SETGET_FUNCS(extent_inline_ref_type, struct btrfs_extent_inline_ref,
>  		   type, 8);
>  BTRFS_SETGET_FUNCS(extent_inline_ref_offset, struct btrfs_extent_inline_ref,
> @@ -365,6 +367,8 @@ static inline u32 btrfs_extent_inline_ref_size(int type)
>  	if (type == BTRFS_EXTENT_DATA_REF_KEY)
>  		return sizeof(struct btrfs_extent_data_ref) +
>  		       offsetof(struct btrfs_extent_inline_ref, offset);
> +	if (type == BTRFS_EXTENT_OWNER_REF_KEY)
> +		return sizeof(struct btrfs_extent_inline_ref);
>  	return 0;
>  }
>  
> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> index 79336fa853db..d5bb6a880713 100644
> --- a/fs/btrfs/backref.c
> +++ b/fs/btrfs/backref.c
> @@ -1129,6 +1129,9 @@ static int add_inline_refs(struct btrfs_backref_walk_ctx *ctx,
>  						       count, sc, GFP_NOFS);
>  			break;
>  		}
> +		case BTRFS_EXTENT_OWNER_REF_KEY:
> +			WARN_ON(!btrfs_fs_incompat(ctx->fs_info, SIMPLE_QUOTA));

Please turn this to an ASSERT, we must catch this during development but
hte warning makes no sense for end users.

> +			break;
>  		default:
>  			WARN_ON(1);
>  		}
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 4f0115553cd3..c6d537bf5ad4 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -342,9 +342,13 @@ int btrfs_get_extent_inline_ref_type(const struct extent_buffer *eb,
>  				     struct btrfs_extent_inline_ref *iref,
>  				     enum btrfs_inline_ref_type is_data)
>  {
> +	struct btrfs_fs_info *fs_info = eb->fs_info;
>  	int type = btrfs_extent_inline_ref_type(eb, iref);
>  	u64 offset = btrfs_extent_inline_ref_offset(eb, iref);
>  
> +	if (type == BTRFS_EXTENT_OWNER_REF_KEY && btrfs_fs_incompat(fs_info, SIMPLE_QUOTA))

I think the conditions should be swapped, first you check that squotas
are enabled then that the type is the one you look for.

> +		return type;
> +
>  	if (type == BTRFS_TREE_BLOCK_REF_KEY ||
>  	    type == BTRFS_SHARED_BLOCK_REF_KEY ||
>  	    type == BTRFS_SHARED_DATA_REF_KEY ||
> @@ -353,26 +357,25 @@ int btrfs_get_extent_inline_ref_type(const struct extent_buffer *eb,
>  			if (type == BTRFS_TREE_BLOCK_REF_KEY)
>  				return type;
>  			if (type == BTRFS_SHARED_BLOCK_REF_KEY) {
> -				ASSERT(eb->fs_info);
> +				ASSERT(fs_info);

I'm not sure what's the point of this assertion, each eb is created with
a valid fs_info. It's not in your new code so you can keep it but would
be good to remove it evenntually.

>  				/*
>  				 * Every shared one has parent tree block,
>  				 * which must be aligned to sector size.
>  				 */
> -				if (offset &&
> -				    IS_ALIGNED(offset, eb->fs_info->sectorsize))
> +				if (offset && IS_ALIGNED(offset, fs_info->sectorsize))
>  					return type;
>  			}
>  		} else if (is_data == BTRFS_REF_TYPE_DATA) {
>  			if (type == BTRFS_EXTENT_DATA_REF_KEY)
>  				return type;
>  			if (type == BTRFS_SHARED_DATA_REF_KEY) {
> -				ASSERT(eb->fs_info);
> +				ASSERT(fs_info);

Same.

>  				/*
>  				 * Every shared one has parent tree block,
>  				 * which must be aligned to sector size.
>  				 */
>  				if (offset &&
> -				    IS_ALIGNED(offset, eb->fs_info->sectorsize))
> +				    IS_ALIGNED(offset, fs_info->sectorsize))
>  					return type;
>  			}
>  		} else {
> @@ -382,7 +385,7 @@ int btrfs_get_extent_inline_ref_type(const struct extent_buffer *eb,
>  	}
>  
>  	btrfs_print_leaf(eb);
> -	btrfs_err(eb->fs_info,
> +	btrfs_err(fs_info,
>  		  "eb %llu iref 0x%lx invalid extent inline ref type %d",
>  		  eb->start, (unsigned long)iref, type);
>  	WARN_ON(1);
> @@ -891,6 +894,11 @@ int lookup_inline_extent_backref(struct btrfs_trans_handle *trans,
>  		}
>  		iref = (struct btrfs_extent_inline_ref *)ptr;
>  		type = btrfs_get_extent_inline_ref_type(leaf, iref, needed);
> +		if (type == BTRFS_EXTENT_OWNER_REF_KEY) {
> +			WARN_ON(!btrfs_fs_incompat(fs_info, SIMPLE_QUOTA));

ASSERT()

> +			ptr += btrfs_extent_inline_ref_size(type);
> +			continue;
> +		}
>  		if (type == BTRFS_REF_TYPE_INVALID) {
>  			err = -EUCLEAN;
>  			goto out;
> @@ -1684,6 +1692,8 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
>  		 node->type == BTRFS_SHARED_DATA_REF_KEY)
>  		ret = run_delayed_data_ref(trans, node, extent_op,
>  					   insert_reserved);
> +	else if (node->type == BTRFS_EXTENT_OWNER_REF_KEY)
> +		ret = 0;
>  	else
>  		BUG();
>  	if (ret && insert_reserved)
> @@ -2250,6 +2260,7 @@ static noinline int check_committed_ref(struct btrfs_root *root,
>  	struct btrfs_extent_item *ei;
>  	struct btrfs_key key;
>  	u32 item_size;
> +	u32 expected_size;
>  	int type;
>  	int ret;
>  
> @@ -2276,10 +2287,22 @@ static noinline int check_committed_ref(struct btrfs_root *root,
>  	ret = 1;
>  	item_size = btrfs_item_size(leaf, path->slots[0]);
>  	ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item);
> +	expected_size = sizeof(*ei) + btrfs_extent_inline_ref_size(BTRFS_EXTENT_DATA_REF_KEY);
> +
> +	/* No inline refs; we need to bail before checking for owner ref */
> +	if (item_size == sizeof(*ei))
> +		goto out;
> +
> +	/* Check for an owner ref; skip over it to the real inline refs */
> +	iref = (struct btrfs_extent_inline_ref *)(ei + 1);
> +	type = btrfs_get_extent_inline_ref_type(leaf, iref, BTRFS_REF_TYPE_DATA);
> +	if (btrfs_fs_incompat(fs_info, SIMPLE_QUOTA) && type == BTRFS_EXTENT_OWNER_REF_KEY) {
> +		expected_size += btrfs_extent_inline_ref_size(BTRFS_EXTENT_OWNER_REF_KEY);
> +		iref = (struct btrfs_extent_inline_ref *)(iref + 1);
> +	}
>  
>  	/* If extent item has more than 1 inline ref then it's shared */
> -	if (item_size != sizeof(*ei) +
> -	    btrfs_extent_inline_ref_size(BTRFS_EXTENT_DATA_REF_KEY))
> +	if (item_size != expected_size)
>  		goto out;
>  
>  	/*
> @@ -2291,8 +2314,6 @@ static noinline int check_committed_ref(struct btrfs_root *root,
>  	     btrfs_root_last_snapshot(&root->root_item)))
>  		goto out;
>  
> -	iref = (struct btrfs_extent_inline_ref *)(ei + 1);
> -
>  	/* If this extent has SHARED_DATA_REF then it's shared */
>  	type = btrfs_get_extent_inline_ref_type(leaf, iref, BTRFS_REF_TYPE_DATA);
>  	if (type != BTRFS_EXTENT_DATA_REF_KEY)
> @@ -4543,18 +4564,23 @@ static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
>  	struct btrfs_root *extent_root;
>  	int ret;
>  	struct btrfs_extent_item *extent_item;
> +	struct btrfs_extent_owner_ref *oref;
>  	struct btrfs_extent_inline_ref *iref;
>  	struct btrfs_path *path;
>  	struct extent_buffer *leaf;
>  	int type;
>  	u32 size;
> +	bool simple_quota = btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE;

	const bool simple_quota = (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE);

>  
>  	if (parent > 0)
>  		type = BTRFS_SHARED_DATA_REF_KEY;
>  	else
>  		type = BTRFS_EXTENT_DATA_REF_KEY;
>  
> -	size = sizeof(*extent_item) + btrfs_extent_inline_ref_size(type);
> +	size = sizeof(*extent_item);
> +	if (simple_quota)
> +		size += btrfs_extent_inline_ref_size(BTRFS_EXTENT_OWNER_REF_KEY);
> +	size += btrfs_extent_inline_ref_size(type);
>  
>  	path = btrfs_alloc_path();
>  	if (!path)
> @@ -4575,8 +4601,16 @@ static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
>  	btrfs_set_extent_flags(leaf, extent_item,
>  			       flags | BTRFS_EXTENT_FLAG_DATA);
>  
> +

Stray newline

>  	iref = (struct btrfs_extent_inline_ref *)(extent_item + 1);
> +	if (simple_quota) {
> +		btrfs_set_extent_inline_ref_type(leaf, iref, BTRFS_EXTENT_OWNER_REF_KEY);
> +		oref = (struct btrfs_extent_owner_ref *)(&iref->offset);
> +		btrfs_set_extent_owner_ref_root_id(leaf, oref, root_objectid);
> +		iref = (struct btrfs_extent_inline_ref *)(oref + 1);
> +	}
>  	btrfs_set_extent_inline_ref_type(leaf, iref, type);
> +
>  	if (parent > 0) {
>  		struct btrfs_shared_data_ref *ref;
>  		ref = (struct btrfs_shared_data_ref *)(iref + 1);
> diff --git a/fs/btrfs/print-tree.c b/fs/btrfs/print-tree.c
> index aa06d9ca911d..3fac15ce0db0 100644
> --- a/fs/btrfs/print-tree.c
> +++ b/fs/btrfs/print-tree.c
> @@ -80,12 +80,20 @@ static void print_extent_data_ref(const struct extent_buffer *eb,
>  	       btrfs_extent_data_ref_count(eb, ref));
>  }
>  
> +static void print_extent_owner_ref(const struct extent_buffer *eb,
> +				   struct btrfs_extent_owner_ref *ref)

ref type can be also made const

> +{
> +	WARN_ON(!btrfs_fs_incompat(eb->fs_info, SIMPLE_QUOTA));

ASSERT()

> +	pr_cont("extent data owner root %llu\n", btrfs_extent_owner_ref_root_id(eb, ref));
> +}
> +
>  static void print_extent_item(const struct extent_buffer *eb, int slot, int type)
>  {
>  	struct btrfs_extent_item *ei;
>  	struct btrfs_extent_inline_ref *iref;
>  	struct btrfs_extent_data_ref *dref;
>  	struct btrfs_shared_data_ref *sref;
> +	struct btrfs_extent_owner_ref *oref;
>  	struct btrfs_disk_key key;
>  	unsigned long end;
>  	unsigned long ptr;
> @@ -159,6 +167,10 @@ static void print_extent_item(const struct extent_buffer *eb, int slot, int type
>  			"\t\t\t(parent %llu not aligned to sectorsize %u)\n",
>  				     offset, eb->fs_info->sectorsize);
>  			break;
> +		case BTRFS_EXTENT_OWNER_REF_KEY:
> +			oref = (struct btrfs_extent_owner_ref *)(&iref->offset);
> +			print_extent_owner_ref(eb, oref);
> +			break;
>  		default:
>  			pr_cont("(extent %llu has INVALID ref type %d)\n",
>  				  eb->start, type);
> diff --git a/fs/btrfs/ref-verify.c b/fs/btrfs/ref-verify.c
> index b7b3bd86f5e2..c0660233feb4 100644
> --- a/fs/btrfs/ref-verify.c
> +++ b/fs/btrfs/ref-verify.c
> @@ -485,6 +485,9 @@ static int process_extent_item(struct btrfs_fs_info *fs_info,
>  			ret = add_shared_data_ref(fs_info, offset, count,
>  						  key->objectid, key->offset);
>  			break;
> +		case BTRFS_EXTENT_OWNER_REF_KEY:
> +			WARN_ON(!btrfs_fs_incompat(fs_info, SIMPLE_QUOTA));
> +			break;
>  		default:
>  			btrfs_err(fs_info, "invalid key type in iref");
>  			ret = -EINVAL;
> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> index 038dfa8f1788..72d29ab74a01 100644
> --- a/fs/btrfs/tree-checker.c
> +++ b/fs/btrfs/tree-checker.c
> @@ -1451,6 +1451,9 @@ static int check_extent_item(struct extent_buffer *leaf,
>  			}
>  			inline_refs += btrfs_shared_data_ref_count(leaf, sref);
>  			break;
> +		case BTRFS_EXTENT_OWNER_REF_KEY:
> +			WARN_ON(!btrfs_fs_incompat(fs_info, SIMPLE_QUOTA));
> +			break;
>  		default:
>  			extent_err(leaf, slot, "unknown inline ref type: %u",
>  				   inline_type);
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index 47aca414a41b..eacb26caf3c6 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -226,6 +226,8 @@
>  
>  #define BTRFS_SHARED_DATA_REF_KEY	184
>  
> +#define BTRFS_EXTENT_OWNER_REF_KEY	190

Please add some description of the new key and use number 188 so it
leaves roughly equal space around it.

> +
>  /*
>   * block groups give us hints into the extent allocation trees.  Which
>   * blocks are free etc etc
> @@ -783,6 +785,10 @@ struct btrfs_shared_data_ref {
>  	__le32 count;
>  } __attribute__ ((__packed__));
>  
> +struct btrfs_extent_owner_ref {
> +	__le64 root_id;
> +} __attribute__ ((__packed__));
> +
>  struct btrfs_extent_inline_ref {
>  	__u8 type;
>  	__le64 offset;
> -- 
> 2.41.0

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 12/18] btrfs: inline owner ref lookup helper
  2023-07-27 22:12 ` [PATCH v5 12/18] btrfs: inline owner ref lookup helper Boris Burkov
@ 2023-09-07 12:10   ` David Sterba
  0 siblings, 0 replies; 53+ messages in thread
From: David Sterba @ 2023-09-07 12:10 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Thu, Jul 27, 2023 at 03:12:59PM -0700, Boris Burkov wrote:
> Inline ref parsing is a bit tricky and relies on a decent amount of
> implicit information, so I think it is beneficial to have a helper
> function for reading the owner ref, if only to "document" the format,
> along with the write path.
> 
> The main subtlety of note which I was missing by open-coding this was
> that it is important to check whether or not inline refs are present
> *at all*. i.e., if we are writing out a new extent under squotas, we
> will always use a big enough item for the inline ref and have it.
> However, it is possible that some random item predating squotas will not
> have any inline refs. In that case, trying to read the "type" field of
> the first inline ref will just be reading garbage in the form of
> whatever is in the next item.
> 
> This will be used by the extent free-ing path, which looks up data
> extent owners as well as a relocation path which needs to grab the owner
> before relocating an extent.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/extent-tree.c | 51 ++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/extent-tree.h |  3 +++
>  2 files changed, 54 insertions(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index c6d537bf5ad4..09fb321fa560 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2805,6 +2805,57 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
>  	return 0;
>  }
>  
> +/*
> + * Helper to parse an extent item's inline extents looking for a simple
> + * quotas owner ref.

      "Parse an extent item's ..."

> + *
> + * @fs_info  - the btrfs_fs_info for this mount

The argument description format is " @fs_info:"

> + * @leaf     - a leaf in the extent tree containing the extent item
> + * @slot     - the slot in the leaf where the extent item is found
> + *
> + * Returns the objectid of the root that originally allocated the extent item
> + * if the inline owner ref is expected and present, otherwise 0.
> + *
> + * If an extent item has an owner ref item, it will be the first
> + * inline ref item. Therefore the logic is to check whether there are
> + * any inline ref items, then check the type of the first one.

Please format comments so they use the 80 columns with acceptable
overflows 3-4 chars if it makes it look nicer.

> + *
> + */
> +u64 btrfs_get_extent_owner_root(struct btrfs_fs_info *fs_info,
> +				struct extent_buffer *leaf,
> +				int slot)
> +{
> +	struct btrfs_extent_item *ei;
> +	struct btrfs_extent_inline_ref *iref;
> +	struct btrfs_extent_owner_ref *oref;
> +	unsigned long ptr;
> +	unsigned long end;
> +	int type;
> +
> +	if (!btrfs_fs_incompat(fs_info, SIMPLE_QUOTA))
> +		return 0;
> +
> +	ei = btrfs_item_ptr(leaf, slot, struct btrfs_extent_item);
> +	ptr = (unsigned long)(ei + 1);
> +	end = (unsigned long)ei + btrfs_item_size(leaf, slot);
> +
> +	/* No inline ref items of any kind, can't check type */
> +	if (ptr == end)
> +		return 0;
> +
> +	iref = (struct btrfs_extent_inline_ref *)ptr;
> +	type = btrfs_get_extent_inline_ref_type(leaf, iref, BTRFS_REF_TYPE_ANY);
> +
> +	/* We found an owner ref, get the root out of it */
> +	if (type == BTRFS_EXTENT_OWNER_REF_KEY) {
> +		oref = (struct btrfs_extent_owner_ref *)(&iref->offset);
> +		return btrfs_extent_owner_ref_root_id(leaf, oref);
> +	}
> +
> +	/* We have inline refs, but not an owner ref */
> +	return 0;
> +}
> +
>  static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
>  				     u64 bytenr, u64 num_bytes, bool is_data)
>  {
> diff --git a/fs/btrfs/extent-tree.h b/fs/btrfs/extent-tree.h
> index b9e148adcd28..7c27652880a2 100644
> --- a/fs/btrfs/extent-tree.h
> +++ b/fs/btrfs/extent-tree.h
> @@ -141,6 +141,9 @@ int btrfs_set_disk_extent_flags(struct btrfs_trans_handle *trans,
>  				struct extent_buffer *eb, u64 flags);
>  int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_ref *ref);
>  
> +u64 btrfs_get_extent_owner_root(struct btrfs_fs_info *fs_info,
> +				struct extent_buffer *leaf,
> +				int slot);
>  int btrfs_free_reserved_extent(struct btrfs_fs_info *fs_info,
>  			       u64 start, u64 len, int delalloc);
>  int btrfs_pin_reserved_extent(struct btrfs_trans_handle *trans, u64 start, u64 len);
> -- 
> 2.41.0

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 13/18] btrfs: record simple quota deltas
  2023-07-27 22:13 ` [PATCH v5 13/18] btrfs: record simple quota deltas Boris Burkov
  2023-08-21 18:08   ` Josef Bacik
@ 2023-09-07 12:12   ` David Sterba
  1 sibling, 0 replies; 53+ messages in thread
From: David Sterba @ 2023-09-07 12:12 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Thu, Jul 27, 2023 at 03:13:00PM -0700, Boris Burkov wrote:
> At the moment that we run delayed refs, we make the final ref-count
> based decision on creating/removing extent (and metadata) items.
> Therefore, it is exactly the spot to hook up simple quotas.
> 
> There are a few important subtleties to the fields we must collect to
> accurately track simple quotas, particularly when removing an extent.
> When removing a data extent, the ref could be in any tree (due to
> reflink, for example) and so we need to recover the owning root id from
> the owner ref item. When removing a metadata extent, we know the owning
> root from the owner field in the header when we create the delayed ref,
> so we can recover it from there.
> 
> We must also be careful to handle reservations properly to not leaked
> reserved space. The happy path is freeing the reservation when the
> simple quota delta runs on a data extent. If that doesn't happen, due to
> refs canceling out or some error, the ref head already has the
> must_insert_reserved machinery to handle this, so we piggy back on that
> and use it to clean up the reserved data.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/delayed-ref.c |  1 +
>  fs/btrfs/delayed-ref.h |  6 +++
>  fs/btrfs/extent-tree.c | 85 +++++++++++++++++++++++++++++++++++++-----
>  3 files changed, 82 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 28ba7a9eb3c3..874c1853d9b1 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -745,6 +745,7 @@ static void init_delayed_ref_head(struct btrfs_delayed_ref_head *head_ref,
>  	head_ref->bytenr = bytenr;
>  	head_ref->num_bytes = num_bytes;
>  	head_ref->ref_mod = count_mod;
> +	head_ref->reserved_bytes = reserved;
>  	head_ref->must_insert_reserved = must_insert_reserved;
>  	head_ref->owning_root = owning_root;
>  	head_ref->is_data = is_data;
> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> index 71f0a6e5d583..221d400dd88f 100644
> --- a/fs/btrfs/delayed-ref.h
> +++ b/fs/btrfs/delayed-ref.h
> @@ -104,6 +104,12 @@ struct btrfs_delayed_ref_head {
>  	 */
>  	int ref_mod;
>  
> +	/*
> +	 * Track reserved bytes when setting must_insert_reserved.
> +	 * On success or cleanup, we will need to free the reservation.
> +	 */
> +	u64 reserved_bytes;
> +
>  	/*
>  	 * when a new extent is allocated, it is just reserved in memory
>  	 * The actual extent isn't inserted into the extent allocation tree
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 09fb321fa560..1b5efd03ef83 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -47,6 +47,7 @@
>  
>  
>  static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
> +			       struct btrfs_delayed_ref_head *href,
>  			       struct btrfs_delayed_ref_node *node, u64 parent,
>  			       u64 root_objectid, u64 owner_objectid,
>  			       u64 owner_offset, int refs_to_drop,
> @@ -1482,6 +1483,7 @@ static int __btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
>  }
>  
>  static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
> +				struct btrfs_delayed_ref_head *href,
>  				struct btrfs_delayed_ref_node *node,
>  				struct btrfs_delayed_extent_op *extent_op,
>  				bool insert_reserved)
> @@ -1505,18 +1507,28 @@ static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
>  	ref_root = ref->root;
>  
>  	if (node->action == BTRFS_ADD_DELAYED_REF && insert_reserved) {
> +		struct btrfs_simple_quota_delta delta = {
> +			.root = href->owning_root,
> +			.num_bytes = node->num_bytes,
> +			.rsv_bytes = href->reserved_bytes,
> +			.is_data = true,
> +			.is_inc	= true,
> +		};
> +
>  		if (extent_op)
>  			flags |= extent_op->flags_to_set;
>  		ret = alloc_reserved_file_extent(trans, parent, ref_root,
>  						 flags, ref->objectid,
>  						 ref->offset, &ins,
>  						 node->ref_mod);
> +		if (!ret)
> +			ret = btrfs_record_simple_quota_delta(trans->fs_info, &delta);
>  	} else if (node->action == BTRFS_ADD_DELAYED_REF) {
>  		ret = __btrfs_inc_extent_ref(trans, node, parent, ref_root,
>  					     ref->objectid, ref->offset,
>  					     node->ref_mod, extent_op);
>  	} else if (node->action == BTRFS_DROP_DELAYED_REF) {
> -		ret = __btrfs_free_extent(trans, node, parent,
> +		ret = __btrfs_free_extent(trans, href, node, parent,
>  					  ref_root, ref->objectid,
>  					  ref->offset, node->ref_mod,
>  					  extent_op);
> @@ -1632,11 +1644,13 @@ static int run_delayed_extent_op(struct btrfs_trans_handle *trans,
>  }
>  
>  static int run_delayed_tree_ref(struct btrfs_trans_handle *trans,
> +				struct btrfs_delayed_ref_head *href,
>  				struct btrfs_delayed_ref_node *node,
>  				struct btrfs_delayed_extent_op *extent_op,
>  				bool insert_reserved)
>  {
>  	int ret = 0;
> +	struct btrfs_fs_info *fs_info = trans->fs_info;
>  	struct btrfs_delayed_tree_ref *ref;
>  	u64 parent = 0;
>  	u64 ref_root = 0;
> @@ -1656,13 +1670,23 @@ static int run_delayed_tree_ref(struct btrfs_trans_handle *trans,
>  		return -EIO;
>  	}
>  	if (node->action == BTRFS_ADD_DELAYED_REF && insert_reserved) {
> +		struct btrfs_simple_quota_delta delta = {
> +			.root = href->owning_root,
> +			.num_bytes = fs_info->nodesize,
> +			.rsv_bytes = 0,
> +			.is_data = false,
> +			.is_inc = true,
> +		};
> +
>  		BUG_ON(!extent_op || !extent_op->update_flags);
>  		ret = alloc_reserved_tree_block(trans, node, extent_op);
> +		if (!ret)
> +			btrfs_record_simple_quota_delta(fs_info, &delta);
>  	} else if (node->action == BTRFS_ADD_DELAYED_REF) {
>  		ret = __btrfs_inc_extent_ref(trans, node, parent, ref_root,
>  					     ref->level, 0, 1, extent_op);
>  	} else if (node->action == BTRFS_DROP_DELAYED_REF) {
> -		ret = __btrfs_free_extent(trans, node, parent, ref_root,
> +		ret = __btrfs_free_extent(trans, href, node, parent, ref_root,
>  					  ref->level, 0, 1, extent_op);
>  	} else {
>  		BUG();
> @@ -1672,6 +1696,7 @@ static int run_delayed_tree_ref(struct btrfs_trans_handle *trans,
>  
>  /* helper function to actually process a single delayed ref entry */
>  static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
> +			       struct btrfs_delayed_ref_head *href,
>  			       struct btrfs_delayed_ref_node *node,
>  			       struct btrfs_delayed_extent_op *extent_op,
>  			       bool insert_reserved)
> @@ -1686,12 +1711,12 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
>  
>  	if (node->type == BTRFS_TREE_BLOCK_REF_KEY ||
>  	    node->type == BTRFS_SHARED_BLOCK_REF_KEY)
> -		ret = run_delayed_tree_ref(trans, node, extent_op,
> +		ret = run_delayed_tree_ref(trans, href, node, extent_op,
>  					   insert_reserved);
>  	else if (node->type == BTRFS_EXTENT_DATA_REF_KEY ||
>  		 node->type == BTRFS_SHARED_DATA_REF_KEY)
> -		ret = run_delayed_data_ref(trans, node, extent_op,
> -					   insert_reserved);
> +		ret = run_delayed_data_ref(trans, href, node,
> +					   extent_op, insert_reserved);
>  	else if (node->type == BTRFS_EXTENT_OWNER_REF_KEY)
>  		ret = 0;
>  	else
> @@ -1788,6 +1813,11 @@ void btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info,
>  		spin_unlock(&delayed_refs->lock);
>  		nr_items += btrfs_csum_bytes_to_leaves(fs_info, head->num_bytes);
>  	}
> +	if (head->must_insert_reserved && head->is_data &&
> +	    btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)

The check for qgroup mode should be the first condition.

> +		btrfs_qgroup_free_refroot(fs_info, head->owning_root,
> +					  head->reserved_bytes,
> +					  BTRFS_QGROUP_RSV_DATA);
>  
>  	btrfs_delayed_refs_rsv_release(fs_info, nr_items);
>  }
> @@ -1934,8 +1964,8 @@ static int btrfs_run_delayed_refs_for_head(struct btrfs_trans_handle *trans,
>  		locked_ref->extent_op = NULL;
>  		spin_unlock(&locked_ref->lock);
>  
> -		ret = run_one_delayed_ref(trans, ref, extent_op,
> -					  must_insert_reserved);
> +		ret = run_one_delayed_ref(trans, locked_ref, ref,
> +					  extent_op, must_insert_reserved);
>  
>  		btrfs_free_delayed_extent_op(extent_op);
>  		if (ret) {
> @@ -2857,11 +2887,12 @@ u64 btrfs_get_extent_owner_root(struct btrfs_fs_info *fs_info,
>  }
>  
>  static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
> -				     u64 bytenr, u64 num_bytes, bool is_data)
> +				     u64 bytenr, struct btrfs_simple_quota_delta *delta)
>  {
>  	int ret;
> +	u64 num_bytes = delta->num_bytes;
>  
> -	if (is_data) {
> +	if (delta->is_data) {
>  		struct btrfs_root *csum_root;
>  
>  		csum_root = btrfs_csum_root(trans->fs_info, bytenr);
> @@ -2872,6 +2903,12 @@ static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
>  		}
>  	}
>  
> +	ret = btrfs_record_simple_quota_delta(trans->fs_info, delta);
> +	if (ret) {
> +		btrfs_abort_transaction(trans, ret);
> +		return ret;
> +	}
> +
>  	ret = add_to_free_space_tree(trans, bytenr, num_bytes);
>  	if (ret) {
>  		btrfs_abort_transaction(trans, ret);
> @@ -2952,6 +2989,7 @@ static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
>   * And that (13631488 EXTENT_DATA_REF <HASH>) gets removed.
>   */
>  static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
> +			       struct btrfs_delayed_ref_head *href,
>  			       struct btrfs_delayed_ref_node *node, u64 parent,
>  			       u64 root_objectid, u64 owner_objectid,
>  			       u64 owner_offset, int refs_to_drop,
> @@ -2974,6 +3012,7 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
>  	u64 bytenr = node->bytenr;
>  	u64 num_bytes = node->num_bytes;
>  	bool skinny_metadata = btrfs_fs_incompat(info, SKINNY_METADATA);
> +	u64 delayed_ref_root = href->owning_root;
>  
>  	extent_root = btrfs_extent_root(info, bytenr);
>  	ASSERT(extent_root);
> @@ -3172,6 +3211,14 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
>  			}
>  		}
>  	} else {
> +		struct btrfs_simple_quota_delta delta = {
> +			.root = delayed_ref_root,
> +			.num_bytes = num_bytes,
> +			.rsv_bytes = 0,
> +			.is_data = is_data,
> +			.is_inc = false,
> +		};
> +
>  		/* In this branch refs == 1 */
>  		if (found_extent) {
>  			if (is_data && refs_to_drop !=
> @@ -3210,6 +3257,16 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
>  				num_to_del = 2;
>  			}
>  		}
> +		/*
> +		 * We can't infer the data owner from the delayed ref, so we
> +		 * need to try to get it from the owning ref item.
> +		 *
> +		 * If it is not present, then that extent was not written under
> +		 * simple quotas mode, so we don't need to account for its
> +		 * deletion.
> +		 */
> +		if (is_data)
> +			delta.root = btrfs_get_extent_owner_root(trans->fs_info, leaf, extent_slot);
>  
>  		ret = btrfs_del_items(trans, extent_root, path, path->slots[0],
>  				      num_to_del);
> @@ -3219,7 +3276,7 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
>  		}
>  		btrfs_release_path(path);
>  
> -		ret = do_free_extent_accounting(trans, bytenr, num_bytes, is_data);
> +		ret = do_free_extent_accounting(trans, bytenr, &delta);
>  	}
>  	btrfs_release_path(path);
>  
> @@ -4790,6 +4847,13 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
>  	int ret;
>  	struct btrfs_block_group *block_group;
>  	struct btrfs_space_info *space_info;
> +	struct btrfs_simple_quota_delta delta = {
> +		.root = root_objectid,
> +		.num_bytes = ins->offset,
> +		.rsv_bytes = 0,
> +		.is_data = true,
> +		.is_inc = true,
> +	};
>  
>  	/*
>  	 * Mixed block groups will exclude before processing the log so we only
> @@ -4818,6 +4882,7 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
>  					 offset, ins, 1);
>  	if (ret)
>  		btrfs_pin_extent(trans, ins->objectid, ins->offset, 1);
> +	ret = btrfs_record_simple_quota_delta(fs_info, &delta);
>  	btrfs_put_block_group(block_group);
>  	return ret;
>  }
> -- 
> 2.41.0

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 14/18] btrfs: simple quota auto hierarchy for nested subvols
  2023-07-27 22:13 ` [PATCH v5 14/18] btrfs: simple quota auto hierarchy for nested subvols Boris Burkov
  2023-08-21 18:10   ` Josef Bacik
@ 2023-09-07 12:16   ` David Sterba
  1 sibling, 0 replies; 53+ messages in thread
From: David Sterba @ 2023-09-07 12:16 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Thu, Jul 27, 2023 at 03:13:01PM -0700, Boris Burkov wrote:
> Consider the following sequence:
> - enable quotas
> - create subvol S id 256 at dir outer/
> - create a qgroup 1/100
> - add 0/256 (S's auto qgroup) to 1/100
> - create subvol T id 257 at dir outer/inner/
> 
> With full qgroups, there is no relationship between 0/257 and either of
> 0/256 or 1/100. There is an inherit feature that the creator of inner/
> can use to specify it ought to be in 1/100.
> 
> Simple quotas are targeted at container isolation, where such automatic
> inheritance for not necessarily trusted/controlled nested subvol
> creation would be quite helpful. Therefore, add a new default behavior
> for simple quotas: when you create a nested subvol, automatically
> inherit as parents any parents of the qgroup of the subvol the new inode
> is going in.
> 
> In our example, 257/0 would also be under 1/100, allowing easy control
> of a total quota over an arbitrary hierarchy of subvolumes.
> 
> I think this _might_ be a generally useful behavior, so it could be
> interesting to put it behind a new inheritance flag that simple quotas
> always use while traditional quotas let the user specify, but this is a
> minimally intrusive change to start.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/ioctl.c       |  2 +-
>  fs/btrfs/qgroup.c      | 44 +++++++++++++++++++++++++++++++++++++++---
>  fs/btrfs/qgroup.h      |  6 +++---
>  fs/btrfs/transaction.c | 13 +++++++++----
>  4 files changed, 54 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 9b61bc62e439..c9b069077fd0 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -652,7 +652,7 @@ static noinline int create_subvol(struct mnt_idmap *idmap,
>  	/* Tree log can't currently deal with an inode which is a new root. */
>  	btrfs_set_log_full_commit(trans);
>  
> -	ret = btrfs_qgroup_inherit(trans, 0, objectid, inherit);
> +	ret = btrfs_qgroup_inherit(trans, 0, objectid, root->root_key.objectid, inherit);
>  	if (ret)
>  		goto out;
>  
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index dedc532669f4..58e9ed0deedd 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -1550,8 +1550,7 @@ static int quick_update_accounting(struct btrfs_fs_info *fs_info,
>  	return ret;
>  }
>  
> -int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src,
> -			      u64 dst)
> +int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src, u64 dst)
>  {
>  	struct btrfs_fs_info *fs_info = trans->fs_info;
>  	struct btrfs_qgroup *parent;
> @@ -2991,6 +2990,40 @@ int btrfs_run_qgroups(struct btrfs_trans_handle *trans)
>  	return ret;
>  }
>  
> +static int qgroup_auto_inherit(struct btrfs_fs_info *fs_info,
> +			       u64 inode_rootid,
> +			       struct btrfs_qgroup_inherit **inherit)
> +{
> +	int i = 0;
> +	u64 num_qgroups = 0;
> +	struct btrfs_qgroup *inode_qg;
> +	struct btrfs_qgroup_list *qg_list;
> +
> +	if (*inherit)
> +		return -EEXIST;
> +
> +	inode_qg = find_qgroup_rb(fs_info, inode_rootid);
> +	if (!inode_qg)
> +		return -ENOENT;
> +
> +	num_qgroups = list_count_nodes(&inode_qg->groups);
> +
> +	if (!num_qgroups)
> +		return 0;
> +
> +	*inherit = kzalloc(sizeof(**inherit) + num_qgroups * sizeof(u64), GFP_NOFS);

There's kcalloc that verifies the potential multiplication overflow.

> +	if (!*inherit)
> +		return -ENOMEM;
> +	(*inherit)->num_qgroups = num_qgroups;
> +
> +	list_for_each_entry(qg_list, &inode_qg->groups, next_group) {
> +		u64 qg_id = qg_list->group->qgroupid;
> +		*((u64 *)((*inherit)+1) + i) = qg_id;

What does this do?

> +	}

Instead of reusing the *inherit for operations, please add a local
variable and at the end assign it to *inherit.

> +
> +	return 0;
> +}
> +
>  /*
>   * Copy the accounting information between qgroups. This is necessary
>   * when a snapshot or a subvolume is created. Throwing an error will
> @@ -2998,7 +3031,8 @@ int btrfs_run_qgroups(struct btrfs_trans_handle *trans)
>   * when a readonly fs is a reasonable outcome.
>   */
>  int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
> -			 u64 objectid, struct btrfs_qgroup_inherit *inherit)
> +			 u64 objectid, u64 inode_rootid,
> +			 struct btrfs_qgroup_inherit *inherit)
>  {
>  	int ret = 0;
>  	int i;
> @@ -3040,6 +3074,9 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
>  		goto out;
>  	}
>  
> +	if (!inherit && btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)

Swap the conditions please

> +		qgroup_auto_inherit(fs_info, inode_rootid, &inherit);
> +
>  	if (inherit) {
>  		i_qgroups = (u64 *)(inherit + 1);
>  		nums = inherit->num_qgroups + 2 * inherit->num_ref_copies +
> @@ -3066,6 +3103,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
>  	if (ret)
>  		goto out;
>  
> +

Stray newline

>  	/*
>  	 * add qgroup to all inherited groups
>  	 */
> diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
> index 94d85b4fbebd..ce6fa8694ca7 100644
> --- a/fs/btrfs/qgroup.h
> +++ b/fs/btrfs/qgroup.h
> @@ -271,8 +271,7 @@ int btrfs_qgroup_rescan(struct btrfs_fs_info *fs_info);
>  void btrfs_qgroup_rescan_resume(struct btrfs_fs_info *fs_info);
>  int btrfs_qgroup_wait_for_completion(struct btrfs_fs_info *fs_info,
>  				     bool interruptible);
> -int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src,
> -			      u64 dst);
> +int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src, u64 dst);
>  int btrfs_del_qgroup_relation(struct btrfs_trans_handle *trans, u64 src,
>  			      u64 dst);
>  int btrfs_create_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid);
> @@ -366,7 +365,8 @@ int btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans, u64 bytenr,
>  int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans);
>  int btrfs_run_qgroups(struct btrfs_trans_handle *trans);
>  int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
> -			 u64 objectid, struct btrfs_qgroup_inherit *inherit);
> +			 u64 objectid, u64 inode_rootid,
> +			 struct btrfs_qgroup_inherit *inherit);
>  void btrfs_qgroup_free_refroot(struct btrfs_fs_info *fs_info,
>  			       u64 ref_root, u64 num_bytes,
>  			       enum btrfs_qgroup_rsv_type type);
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 25217888e897..fb857147df57 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1529,13 +1529,14 @@ static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
>  	int ret;
>  
>  	/*
> -	 * Save some performance in the case that full qgroups are not
> +	 * Save some performance in the case that qgroups are not
>  	 * enabled. If this check races with the ioctl, rescan will
>  	 * kick in anyway.
>  	 */
>  	if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_FULL)
>  		return 0;
>  
> +

Again

>  	/*
>  	 * Ensure dirty @src will be committed.  Or, after coming
>  	 * commit_fs_roots() and switch_commit_roots(), any dirty but not
> @@ -1572,7 +1573,7 @@ static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
>  
>  	/* Now qgroup are all updated, we can inherit it to new qgroups */
>  	ret = btrfs_qgroup_inherit(trans, src->root_key.objectid, dst_objectid,
> -				   inherit);
> +				   parent->root_key.objectid, inherit);
>  	if (ret < 0)
>  		goto out;
>  
> @@ -1839,8 +1840,12 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
>  	 * To co-operate with that hack, we do hack again.
>  	 * Or snapshot will be greatly slowed down by a subtree qgroup rescan
>  	 */
> -	ret = qgroup_account_snapshot(trans, root, parent_root,
> -				      pending->inherit, objectid);
> +	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_FULL)
> +		ret = qgroup_account_snapshot(trans, root, parent_root,
> +					      pending->inherit, objectid);
> +	else if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE)
> +		ret = btrfs_qgroup_inherit(trans, root->root_key.objectid, objectid,
> +					   parent_root->root_key.objectid, pending->inherit);
>  	if (ret < 0)
>  		goto fail;
>  
> -- 
> 2.41.0

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 15/18] btrfs: check generation when recording simple quota delta
  2023-07-27 22:13 ` [PATCH v5 15/18] btrfs: check generation when recording simple quota delta Boris Burkov
  2023-08-21 18:11   ` Josef Bacik
@ 2023-09-07 12:24   ` David Sterba
  2023-09-08 21:41     ` Boris Burkov
  1 sibling, 1 reply; 53+ messages in thread
From: David Sterba @ 2023-09-07 12:24 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Thu, Jul 27, 2023 at 03:13:02PM -0700, Boris Burkov wrote:
> Simple quotas count extents only from the moment the feature is enabled.
> Therefore, if we do something like:
> 1. create subvol S
> 2. write F in S
> 3. enable quotas
> 4. remove F
> 5. write G in S
> 
> then after 3. and 4. we would expect the simple quota usage of S to be 0
> (putting aside some metadata extents that might be written) and after
> 5., it should be the size of G plus metadata. Therefore, we need to be
> able to determine whether a particular quota delta we are processing
> predates simple quota enablement.
> 
> To do this, store the transaction id when quotas were enabled. In
> fs_info for immediate use and in the quota status item to make it
> recoverable on mount. When we see a delta, check if the generation of
> the extent item is less than that of quota enablement. If so, we should
> ignore the delta from this extent.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/accessors.h            |  2 ++
>  fs/btrfs/extent-tree.c          |  4 ++++
>  fs/btrfs/fs.h                   |  2 ++
>  fs/btrfs/qgroup.c               | 14 ++++++++++++--
>  fs/btrfs/qgroup.h               |  1 +
>  include/uapi/linux/btrfs_tree.h |  7 +++++++
>  6 files changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
> index a23045c05937..513f8edbd98e 100644
> --- a/fs/btrfs/accessors.h
> +++ b/fs/btrfs/accessors.h
> @@ -970,6 +970,8 @@ BTRFS_SETGET_FUNCS(qgroup_status_flags, struct btrfs_qgroup_status_item,
>  		   flags, 64);
>  BTRFS_SETGET_FUNCS(qgroup_status_rescan, struct btrfs_qgroup_status_item,
>  		   rescan, 64);
> +BTRFS_SETGET_FUNCS(qgroup_status_enable_gen, struct btrfs_qgroup_status_item,
> +		   enable_gen, 64);
>  
>  /* btrfs_qgroup_info_item */
>  BTRFS_SETGET_FUNCS(qgroup_info_generation, struct btrfs_qgroup_info_item,
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 1b5efd03ef83..395ab46e520b 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -1513,6 +1513,7 @@ static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
>  			.rsv_bytes = href->reserved_bytes,
>  			.is_data = true,
>  			.is_inc	= true,
> +			.generation = trans->transid,
>  		};
>  
>  		if (extent_op)
> @@ -1676,6 +1677,7 @@ static int run_delayed_tree_ref(struct btrfs_trans_handle *trans,
>  			.rsv_bytes = 0,
>  			.is_data = false,
>  			.is_inc = true,
> +			.generation = trans->transid,
>  		};
>  
>  		BUG_ON(!extent_op || !extent_op->update_flags);
> @@ -3217,6 +3219,7 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
>  			.rsv_bytes = 0,
>  			.is_data = is_data,
>  			.is_inc = false,
> +			.generation = btrfs_extent_generation(leaf, ei),
>  		};
>  
>  		/* In this branch refs == 1 */
> @@ -4850,6 +4853,7 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
>  	struct btrfs_simple_quota_delta delta = {
>  		.root = root_objectid,
>  		.num_bytes = ins->offset,
> +		.generation = trans->transid,
>  		.rsv_bytes = 0,
>  		.is_data = true,
>  		.is_inc = true,
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index f76f450c2abf..da7b623ff15f 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -802,6 +802,8 @@ struct btrfs_fs_info {
>  	spinlock_t eb_leak_lock;
>  	struct list_head allocated_ebs;
>  #endif
> +
> +	u64 quota_enable_gen;

Please move it to the other quota/qgroup related members, at the end of
fs_info there's only debugging stuff.

>  };
>  
>  static inline void btrfs_set_last_root_drop_gen(struct btrfs_fs_info *fs_info,
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 58e9ed0deedd..a8a603242431 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -454,6 +454,8 @@ int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info)
>  			}
>  			fs_info->qgroup_flags = btrfs_qgroup_status_flags(l, ptr);
>  			simple = fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_SIMPLE;
> +			if (simple)
> +				fs_info->quota_enable_gen = btrfs_qgroup_status_enable_gen(l, ptr);
>  			if (btrfs_qgroup_status_generation(l, ptr) !=
>  			    fs_info->generation && !simple) {
>  				qgroup_mark_inconsistent(fs_info);
> @@ -1107,10 +1109,12 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info,
>  	btrfs_set_qgroup_status_generation(leaf, ptr, trans->transid);
>  	btrfs_set_qgroup_status_version(leaf, ptr, BTRFS_QGROUP_STATUS_VERSION);
>  	fs_info->qgroup_flags = BTRFS_QGROUP_STATUS_FLAG_ON;
> -	if (simple)
> +	if (simple) {
>  		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_SIMPLE;
> -	else
> +		btrfs_set_qgroup_status_enable_gen(leaf, ptr, trans->transid);
> +	} else {
>  		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
> +	}
>  	btrfs_set_qgroup_status_flags(leaf, ptr, fs_info->qgroup_flags &
>  				      BTRFS_QGROUP_STATUS_FLAGS_MASK);
>  	btrfs_set_qgroup_status_rescan(leaf, ptr, 0);
> @@ -1202,6 +1206,8 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info,
>  		goto out_free_path;
>  	}
>  
> +	fs_info->quota_enable_gen = trans->transid;
> +
>  	mutex_unlock(&fs_info->qgroup_ioctl_lock);
>  	/*
>  	 * Commit the transaction while not holding qgroup_ioctl_lock, to avoid
> @@ -4622,6 +4628,10 @@ int btrfs_record_simple_quota_delta(struct btrfs_fs_info *fs_info,
>  	if (!is_fstree(root))
>  		return 0;
>  
> +	/* If the extent predates enabling quotas, don't count it. */
> +	if (delta->generation < fs_info->quota_enable_gen)
> +		return 0;
> +
>  	spin_lock(&fs_info->qgroup_lock);
>  	qgroup = find_qgroup_rb(fs_info, root);
>  	if (!qgroup) {
> diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
> index ce6fa8694ca7..ae1ce14b365c 100644
> --- a/fs/btrfs/qgroup.h
> +++ b/fs/btrfs/qgroup.h
> @@ -241,6 +241,7 @@ struct btrfs_simple_quota_delta {
>  	u64 rsv_bytes; /* The number of bytes reserved for this extent */
>  	bool is_inc; /* Whether we are using or freeing the extent */
>  	bool is_data; /* Whether the extent is data or metadata */
> +	u64 generation; /* The generation the extent was created in */

Please reorder it so it does not leave gaps between struct members.

>  };
>  
>  static inline u64 btrfs_qgroup_subvolid(u64 qgroupid)
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index eacb26caf3c6..1120ce3dae42 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -1242,6 +1242,13 @@ struct btrfs_qgroup_status_item {
>  	 * of the scan. It contains a logical address
>  	 */
>  	__le64 rescan;
> +
> +	/*
> +	 * the generation when quotas are enabled. Used by simple quotas to
> +	 * avoid decrementing when freeing an extent that was written before
> +	 * enable.
> +	 */
> +	__le64 enable_gen;

This is public interface and btrfs_qgroup_status_item is used in many
places in user space at least in btrfs-progs. This needs a lot of
sanity checks.

>  } __attribute__ ((__packed__));
>  
>  struct btrfs_qgroup_info_item {
> -- 
> 2.41.0

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 16/18] btrfs: track metadata relocation cow with simple quota
  2023-07-27 22:13 ` [PATCH v5 16/18] btrfs: track metadata relocation cow with simple quota Boris Burkov
@ 2023-09-07 12:27   ` David Sterba
  0 siblings, 0 replies; 53+ messages in thread
From: David Sterba @ 2023-09-07 12:27 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Thu, Jul 27, 2023 at 03:13:03PM -0700, Boris Burkov wrote:
> Relocation cows metadata blocks in two cases for the reloc root:
> - copying the subvol root item when creating the reloc root
> - copying a btree node when there is a cow during relocation
> 
> In both cases, the resulting btree node hits an abnormal code path with
> respect to the owner field in its btrfs_header. It first creates the
> root item for the new objectid, which populates the reloc root id, and
> it at this point that delayed refs are created.
> 
> Later, it fully copies the old node into the new node (including the
> original owner field) which overwrites it. This results in a simple
> quotas mismatch where we run the delayed ref for the reloc root which
> has no simple quota effect (reloc root is not an fstree) but when we
> ultimately delete the node, the owner is the real original fstree and we
> do free the space.
> 
> To work around this without tampering with the behavior of relocation,
> add a parameter to btrfs_add_tree_block that lets the relocation code
> path specify a different owning root than the "operating" root (in this
> case, owning root is the real root and the operating root is the reloc
> root). These can naturally be plumbed into delayed refs that have the
> same concept.
> 
> Note that this is a double count in some sense, but a relatively natural
> one, as there are really two extents, and the old one will be deleted
> soon. This is consistent with how data relocation extents are accounted
> by simple quotas.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/ctree.c       | 22 ++++++++++++++--------
>  fs/btrfs/disk-io.c     |  4 ++--
>  fs/btrfs/extent-tree.c |  8 ++++++--
>  fs/btrfs/extent-tree.h |  3 ++-
>  fs/btrfs/ioctl.c       |  2 +-
>  5 files changed, 25 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index a4cb4b642987..cb0d4535de37 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -316,6 +316,7 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans,
>  	int ret = 0;
>  	int level;
>  	struct btrfs_disk_key disk_key;
> +	u64 reloc_src_root = 0;
>  
>  	WARN_ON(test_bit(BTRFS_ROOT_SHAREABLE, &root->state) &&
>  		trans->transid != fs_info->running_transaction->transid);
> @@ -328,9 +329,11 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans,
>  	else
>  		btrfs_node_key(buf, &disk_key, 0);
>  
> +	if (new_root_objectid == BTRFS_TREE_RELOC_OBJECTID)
> +		reloc_src_root = btrfs_header_owner(buf);
>  	cow = btrfs_alloc_tree_block(trans, root, 0, new_root_objectid,
>  				     &disk_key, level, buf->start, 0,
> -				     BTRFS_NESTING_NEW_ROOT);
> +				     BTRFS_NESTING_NEW_ROOT, reloc_src_root);
>  	if (IS_ERR(cow))
>  		return PTR_ERR(cow);
>  
> @@ -522,6 +525,7 @@ static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans,
>  	int last_ref = 0;
>  	int unlock_orig = 0;
>  	u64 parent_start = 0;
> +	u64 reloc_src_root = 0;
>  
>  	if (*cow_ret == buf)
>  		unlock_orig = 1;
> @@ -540,12 +544,14 @@ static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans,
>  	else
>  		btrfs_node_key(buf, &disk_key, 0);
>  
> -	if ((root->root_key.objectid == BTRFS_TREE_RELOC_OBJECTID) && parent)
> -		parent_start = parent->start;
> -
> +	if (root->root_key.objectid == BTRFS_TREE_RELOC_OBJECTID) {
> +		if (parent)
> +			parent_start = parent->start;
> +		reloc_src_root = btrfs_header_owner(buf);
> +	}
>  	cow = btrfs_alloc_tree_block(trans, root, parent_start,
>  				     root->root_key.objectid, &disk_key, level,
> -				     search_start, empty_size, nest);
> +				     search_start, empty_size, nest, reloc_src_root);
>  	if (IS_ERR(cow))
>  		return PTR_ERR(cow);
>  
> @@ -2956,7 +2962,7 @@ static noinline int insert_new_root(struct btrfs_trans_handle *trans,
>  
>  	c = btrfs_alloc_tree_block(trans, root, 0, root->root_key.objectid,
>  				   &lower_key, level, root->node->start, 0,
> -				   BTRFS_NESTING_NEW_ROOT);
> +				   BTRFS_NESTING_NEW_ROOT, 0);
>  	if (IS_ERR(c))
>  		return PTR_ERR(c);
>  
> @@ -3100,7 +3106,7 @@ static noinline int split_node(struct btrfs_trans_handle *trans,
>  
>  	split = btrfs_alloc_tree_block(trans, root, 0, root->root_key.objectid,
>  				       &disk_key, level, c->start, 0,
> -				       BTRFS_NESTING_SPLIT);
> +				       BTRFS_NESTING_SPLIT, 0);
>  	if (IS_ERR(split))
>  		return PTR_ERR(split);
>  
> @@ -3853,7 +3859,7 @@ static noinline int split_leaf(struct btrfs_trans_handle *trans,
>  	right = btrfs_alloc_tree_block(trans, root, 0, root->root_key.objectid,
>  				       &disk_key, 0, l->start, 0,
>  				       num_doubles ? BTRFS_NESTING_NEW_ROOT :
> -				       BTRFS_NESTING_SPLIT);
> +				       BTRFS_NESTING_SPLIT, 0);
>  	if (IS_ERR(right))
>  		return PTR_ERR(right);
>  
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index b4495d4c1533..e2b0e11800fc 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -862,7 +862,7 @@ struct btrfs_root *btrfs_create_tree(struct btrfs_trans_handle *trans,
>  	root->root_key.offset = 0;
>  
>  	leaf = btrfs_alloc_tree_block(trans, root, 0, objectid, NULL, 0, 0, 0,
> -				      BTRFS_NESTING_NORMAL);
> +				      BTRFS_NESTING_NORMAL, 0);
>  	if (IS_ERR(leaf)) {
>  		ret = PTR_ERR(leaf);
>  		leaf = NULL;
> @@ -939,7 +939,7 @@ int btrfs_alloc_log_tree_node(struct btrfs_trans_handle *trans,
>  	 */
>  
>  	leaf = btrfs_alloc_tree_block(trans, root, 0, BTRFS_TREE_LOG_OBJECTID,
> -			NULL, 0, 0, 0, BTRFS_NESTING_NORMAL);
> +			NULL, 0, 0, 0, BTRFS_NESTING_NORMAL, 0);
>  	if (IS_ERR(leaf))
>  		return PTR_ERR(leaf);
>  
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 395ab46e520b..50db75529a83 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4989,7 +4989,8 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
>  					     const struct btrfs_disk_key *key,
>  					     int level, u64 hint,
>  					     u64 empty_size,
> -					     enum btrfs_lock_nesting nest)
> +					     enum btrfs_lock_nesting nest,
> +					     u64 reloc_src_root)

Please move the new parameter before 'nest'.

>  {
>  	struct btrfs_fs_info *fs_info = root->fs_info;
>  	struct btrfs_key ins;
> @@ -5001,6 +5002,7 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
>  	int ret;
>  	u32 blocksize = fs_info->nodesize;
>  	bool skinny_metadata = btrfs_fs_incompat(fs_info, SKINNY_METADATA);
> +	u64 owning_root;
>  
>  #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
>  	if (btrfs_is_testing(fs_info)) {
> @@ -5027,11 +5029,13 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
>  		ret = PTR_ERR(buf);
>  		goto out_free_reserved;
>  	}
> +	owning_root = btrfs_header_owner(buf);
>  
>  	if (root_objectid == BTRFS_TREE_RELOC_OBJECTID) {
>  		if (parent == 0)
>  			parent = ins.objectid;
>  		flags |= BTRFS_BLOCK_FLAG_FULL_BACKREF;
> +		owning_root = reloc_src_root;
>  	} else
>  		BUG_ON(parent > 0);
>  
> @@ -5051,7 +5055,7 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
>  		extent_op->level = level;
>  
>  		btrfs_init_generic_ref(&generic_ref, BTRFS_ADD_DELAYED_EXTENT,
> -				       ins.objectid, ins.offset, parent, btrfs_header_owner(buf));
> +				       ins.objectid, ins.offset, parent, owning_root);
>  		btrfs_init_tree_ref(&generic_ref, level, root_objectid,
>  				    root->root_key.objectid, false);
>  		btrfs_ref_tree_mod(fs_info, &generic_ref);
> diff --git a/fs/btrfs/extent-tree.h b/fs/btrfs/extent-tree.h
> index 7c27652880a2..99b11e278ae4 100644
> --- a/fs/btrfs/extent-tree.h
> +++ b/fs/btrfs/extent-tree.h
> @@ -118,7 +118,8 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
>  					     const struct btrfs_disk_key *key,
>  					     int level, u64 hint,
>  					     u64 empty_size,
> -					     enum btrfs_lock_nesting nest);
> +					     enum btrfs_lock_nesting nest,
> +					     u64 reloc_src_root);
>  void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
>  			   u64 root_id,
>  			   struct extent_buffer *buf,
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index c9b069077fd0..f3807def6596 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -657,7 +657,7 @@ static noinline int create_subvol(struct mnt_idmap *idmap,
>  		goto out;
>  
>  	leaf = btrfs_alloc_tree_block(trans, root, 0, objectid, NULL, 0, 0, 0,
> -				      BTRFS_NESTING_NORMAL);
> +				      BTRFS_NESTING_NORMAL, 0);
>  	if (IS_ERR(leaf)) {
>  		ret = PTR_ERR(leaf);
>  		goto out;
> -- 
> 2.41.0

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 00/18] btrfs: simple quotas
  2023-09-07 10:51 ` [PATCH v5 00/18] btrfs: simple quotas David Sterba
@ 2023-09-07 20:51   ` Boris Burkov
  2023-09-11 18:06     ` David Sterba
  2023-09-11 18:12   ` David Sterba
  1 sibling, 1 reply; 53+ messages in thread
From: Boris Burkov @ 2023-09-07 20:51 UTC (permalink / raw)
  To: David Sterba; +Cc: linux-btrfs, kernel-team

On Thu, Sep 07, 2023 at 12:51:15PM +0200, David Sterba wrote:
> On Thu, Jul 27, 2023 at 03:12:47PM -0700, Boris Burkov wrote:
> > btrfs quota groups (qgroups) are a compelling feature of btrfs that
> > allow flexible control for limiting subvolume data and metadata usage.
> > However, due to btrfs's high level decision to tradeoff snapshot
> > performance against ref-counting performance, qgroups suffer from
> > non-trivial performance issues that make them unattractive in certain
> > workloads. Particularly, frequent backref walking during writes and
> > during commits can make operations increasingly expensive as the number
> > of snapshots scales up. For that reason, we have never been able to
> > commit to using qgroups in production at Meta, despite significant
> > interest from people running container workloads, where we would benefit
> > from protecting the rest of the host from a buggy application in a
> > container running away with disk usage. This patch series introduces a
> > simplified version of qgroups called
> > simple quotas (squotas) which never computes global reference counts
> > for extents, and thus has similar performance characteristics to normal,
> > quotas disabled, btrfs. The "trick" is that in simple quotas mode, we
> > account all extents permanently to the subvolume in which they were
> > originally created. That allows us to make all accounting 1:1 with
> > extent item lifetime, removing the need to walk backrefs. However,
> > this sacrifices the ability to compute shared vs. exclusive usage. It
> > also results in counter-intuitive, though still predictable and simple
> > accounting in the cases where an original extent is removed while a
> > shared copy still exists. Qgroups is able to detect that case and count
> > the remaining copy as an exclusive owner, while squotas is not. As a
> > result, squotas works best when the original extent is immutable and
> > outlives any clones.
> > 
> > ==Format Change==
> > In order to track the original creating subvolume of a data extent in
> > the face of reflinks, it is necessary to add additional accounting to
> > the extent item. To save space, this is done with a new inline ref item.
> > However, the downside of this approach is that it makes enabling squota
> > an incompat change, denoted by the new incompat bit SIMPLE_QUOTA. When
> > this bit is set and quotas are enabled, new extent items get the extra
> > accounting, and freed extent items check for the accounting to find
> > their creating subvolume. In addition, 1:1 with this incompat bit,
> > the quota status item now tracks a "quota enablement generation" needed
> > for properly handling deleting extents with predate enablement.
> > 
> > ==API==
> > Squotas reuses the api of qgroups.
> 
> So apart from the accounting, the hierarchy of qgroups can be still
> built as before, right? In the example you create a group 1/100 so I
> assume that it's still qgroups from the outside, and that the limits can
> be set.

Yes, you can create quota group hierarchies with the same nesting
behavior. I am only changing the accounting methodology (and added auto
hierarchy)

> 
> Because if not, then squotas would make more sense as a separate
> infrastructure, under quotas. Like that quotas are the abstraction while
> qgroups or squota would be the implementation.
> 
> > The only difference is that when you
> > enable quotas via `btrfs quota enable`, you pass the `--simple` flag.
> > Squotas will always report exclusive == shared for each qgroup. Squotas
> > deal with extent_item/metadata_item sizes and thus do not do anything
> > special with compression. Squotas also introduce auto inheritance for
> > nested subvols. The API is documented more fully in the documentation
> > patches in btrfs-progs.
> 
> The lack of exclusive size sharing will be confusing I guess, so we need
> to make it clear in the documentation and in the UI that it's either
> full or simple mode.

I am happy to iterate on that. I think always reporting as shared=0,
since the *ownership* is exclusive. I opted for making them equal since
it sort of both shared usage (we don't know if it's shared nor when it
will be freed) and exclusive usage (belongs to this subvol by owner ref)

> 
> I've added the patchset to for-next, we may need an iteration or two to
> fix some issues I've seen so far but on the fundamental level I think
> it's ok.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 04/18] btrfs: add simple_quota incompat feature to sysfs
  2023-09-07 11:28   ` David Sterba
@ 2023-09-07 20:56     ` Boris Burkov
  0 siblings, 0 replies; 53+ messages in thread
From: Boris Burkov @ 2023-09-07 20:56 UTC (permalink / raw)
  To: David Sterba; +Cc: linux-btrfs, kernel-team

On Thu, Sep 07, 2023 at 01:28:25PM +0200, David Sterba wrote:
> On Thu, Jul 27, 2023 at 03:12:51PM -0700, Boris Burkov wrote:
> > Add an entry in the features directory for the new incompat flag
> > 
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >  fs/btrfs/sysfs.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> > index e53614753391..f62bba0068ca 100644
> > --- a/fs/btrfs/sysfs.c
> > +++ b/fs/btrfs/sysfs.c
> > @@ -291,6 +291,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(metadata_uuid, METADATA_UUID);
> >  BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
> >  BTRFS_FEAT_ATTR_COMPAT_RO(block_group_tree, BLOCK_GROUP_TREE);
> >  BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
> > +BTRFS_FEAT_ATTR_INCOMPAT(simple_quota, SIMPLE_QUOTA);
> 
> I'm not sure if you mentioned in the cover letter or if we had discussed
> it before, but does this need to be a full incompat bit? I.e. no mount
> on older kernels, compared to a COMPAT_RO which would allow
> read-only mount.

Unfortunately, as it is, simple quotas does need a full incompat bit.
That is because of the details of how the kernel parses inline refs, but
essentially that code relies on item size being fully exhausted by an
iteration that steps forward by hard-code-computed inline ref size
chunks. That parsing code blows up on the new structures in a way that
can't be fixed, as far as I can tell.

To be COMPAT_RO, we would need to introduce an entirely new item. We
discussed this in one of the early discussions and concluded that was
not worth the space cost compared to an inline item.

> 
> >  #ifdef CONFIG_BLK_DEV_ZONED
> >  BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
> >  #endif
> > @@ -322,6 +323,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
> >  	BTRFS_FEAT_ATTR_PTR(free_space_tree),
> >  	BTRFS_FEAT_ATTR_PTR(raid1c34),
> >  	BTRFS_FEAT_ATTR_PTR(block_group_tree),
> > +	BTRFS_FEAT_ATTR_PTR(simple_quota),
> >  #ifdef CONFIG_BLK_DEV_ZONED
> >  	BTRFS_FEAT_ATTR_PTR(zoned),
> >  #endif
> > -- 
> > 2.41.0

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 15/18] btrfs: check generation when recording simple quota delta
  2023-09-07 12:24   ` David Sterba
@ 2023-09-08 21:41     ` Boris Burkov
  2023-09-11 18:00       ` David Sterba
  0 siblings, 1 reply; 53+ messages in thread
From: Boris Burkov @ 2023-09-08 21:41 UTC (permalink / raw)
  To: David Sterba; +Cc: linux-btrfs, kernel-team

On Thu, Sep 07, 2023 at 02:24:49PM +0200, David Sterba wrote:
> On Thu, Jul 27, 2023 at 03:13:02PM -0700, Boris Burkov wrote:
> > Simple quotas count extents only from the moment the feature is enabled.
> > Therefore, if we do something like:
> > 1. create subvol S
> > 2. write F in S
> > 3. enable quotas
> > 4. remove F
> > 5. write G in S
> > 
> > then after 3. and 4. we would expect the simple quota usage of S to be 0
> > (putting aside some metadata extents that might be written) and after
> > 5., it should be the size of G plus metadata. Therefore, we need to be
> > able to determine whether a particular quota delta we are processing
> > predates simple quota enablement.
> > 
> > To do this, store the transaction id when quotas were enabled. In
> > fs_info for immediate use and in the quota status item to make it
> > recoverable on mount. When we see a delta, check if the generation of
> > the extent item is less than that of quota enablement. If so, we should
> > ignore the delta from this extent.
> > 
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >  fs/btrfs/accessors.h            |  2 ++
> >  fs/btrfs/extent-tree.c          |  4 ++++
> >  fs/btrfs/fs.h                   |  2 ++
> >  fs/btrfs/qgroup.c               | 14 ++++++++++++--
> >  fs/btrfs/qgroup.h               |  1 +
> >  include/uapi/linux/btrfs_tree.h |  7 +++++++
> >  6 files changed, 28 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
> > index a23045c05937..513f8edbd98e 100644
> > --- a/fs/btrfs/accessors.h
> > +++ b/fs/btrfs/accessors.h
> > @@ -970,6 +970,8 @@ BTRFS_SETGET_FUNCS(qgroup_status_flags, struct btrfs_qgroup_status_item,
> >  		   flags, 64);
> >  BTRFS_SETGET_FUNCS(qgroup_status_rescan, struct btrfs_qgroup_status_item,
> >  		   rescan, 64);
> > +BTRFS_SETGET_FUNCS(qgroup_status_enable_gen, struct btrfs_qgroup_status_item,
> > +		   enable_gen, 64);
> >  
> >  /* btrfs_qgroup_info_item */
> >  BTRFS_SETGET_FUNCS(qgroup_info_generation, struct btrfs_qgroup_info_item,
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 1b5efd03ef83..395ab46e520b 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -1513,6 +1513,7 @@ static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
> >  			.rsv_bytes = href->reserved_bytes,
> >  			.is_data = true,
> >  			.is_inc	= true,
> > +			.generation = trans->transid,
> >  		};
> >  
> >  		if (extent_op)
> > @@ -1676,6 +1677,7 @@ static int run_delayed_tree_ref(struct btrfs_trans_handle *trans,
> >  			.rsv_bytes = 0,
> >  			.is_data = false,
> >  			.is_inc = true,
> > +			.generation = trans->transid,
> >  		};
> >  
> >  		BUG_ON(!extent_op || !extent_op->update_flags);
> > @@ -3217,6 +3219,7 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
> >  			.rsv_bytes = 0,
> >  			.is_data = is_data,
> >  			.is_inc = false,
> > +			.generation = btrfs_extent_generation(leaf, ei),
> >  		};
> >  
> >  		/* In this branch refs == 1 */
> > @@ -4850,6 +4853,7 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
> >  	struct btrfs_simple_quota_delta delta = {
> >  		.root = root_objectid,
> >  		.num_bytes = ins->offset,
> > +		.generation = trans->transid,
> >  		.rsv_bytes = 0,
> >  		.is_data = true,
> >  		.is_inc = true,
> > diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> > index f76f450c2abf..da7b623ff15f 100644
> > --- a/fs/btrfs/fs.h
> > +++ b/fs/btrfs/fs.h
> > @@ -802,6 +802,8 @@ struct btrfs_fs_info {
> >  	spinlock_t eb_leak_lock;
> >  	struct list_head allocated_ebs;
> >  #endif
> > +
> > +	u64 quota_enable_gen;
> 
> Please move it to the other quota/qgroup related members, at the end of
> fs_info there's only debugging stuff.
> 
> >  };
> >  
> >  static inline void btrfs_set_last_root_drop_gen(struct btrfs_fs_info *fs_info,
> > diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> > index 58e9ed0deedd..a8a603242431 100644
> > --- a/fs/btrfs/qgroup.c
> > +++ b/fs/btrfs/qgroup.c
> > @@ -454,6 +454,8 @@ int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info)
> >  			}
> >  			fs_info->qgroup_flags = btrfs_qgroup_status_flags(l, ptr);
> >  			simple = fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_SIMPLE;
> > +			if (simple)
> > +				fs_info->quota_enable_gen = btrfs_qgroup_status_enable_gen(l, ptr);
> >  			if (btrfs_qgroup_status_generation(l, ptr) !=
> >  			    fs_info->generation && !simple) {
> >  				qgroup_mark_inconsistent(fs_info);
> > @@ -1107,10 +1109,12 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info,
> >  	btrfs_set_qgroup_status_generation(leaf, ptr, trans->transid);
> >  	btrfs_set_qgroup_status_version(leaf, ptr, BTRFS_QGROUP_STATUS_VERSION);
> >  	fs_info->qgroup_flags = BTRFS_QGROUP_STATUS_FLAG_ON;
> > -	if (simple)
> > +	if (simple) {
> >  		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_SIMPLE;
> > -	else
> > +		btrfs_set_qgroup_status_enable_gen(leaf, ptr, trans->transid);
> > +	} else {
> >  		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
> > +	}
> >  	btrfs_set_qgroup_status_flags(leaf, ptr, fs_info->qgroup_flags &
> >  				      BTRFS_QGROUP_STATUS_FLAGS_MASK);
> >  	btrfs_set_qgroup_status_rescan(leaf, ptr, 0);
> > @@ -1202,6 +1206,8 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info,
> >  		goto out_free_path;
> >  	}
> >  
> > +	fs_info->quota_enable_gen = trans->transid;
> > +
> >  	mutex_unlock(&fs_info->qgroup_ioctl_lock);
> >  	/*
> >  	 * Commit the transaction while not holding qgroup_ioctl_lock, to avoid
> > @@ -4622,6 +4628,10 @@ int btrfs_record_simple_quota_delta(struct btrfs_fs_info *fs_info,
> >  	if (!is_fstree(root))
> >  		return 0;
> >  
> > +	/* If the extent predates enabling quotas, don't count it. */
> > +	if (delta->generation < fs_info->quota_enable_gen)
> > +		return 0;
> > +
> >  	spin_lock(&fs_info->qgroup_lock);
> >  	qgroup = find_qgroup_rb(fs_info, root);
> >  	if (!qgroup) {
> > diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
> > index ce6fa8694ca7..ae1ce14b365c 100644
> > --- a/fs/btrfs/qgroup.h
> > +++ b/fs/btrfs/qgroup.h
> > @@ -241,6 +241,7 @@ struct btrfs_simple_quota_delta {
> >  	u64 rsv_bytes; /* The number of bytes reserved for this extent */
> >  	bool is_inc; /* Whether we are using or freeing the extent */
> >  	bool is_data; /* Whether the extent is data or metadata */
> > +	u64 generation; /* The generation the extent was created in */
> 
> Please reorder it so it does not leave gaps between struct members.
> 
> >  };
> >  
> >  static inline u64 btrfs_qgroup_subvolid(u64 qgroupid)
> > diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> > index eacb26caf3c6..1120ce3dae42 100644
> > --- a/include/uapi/linux/btrfs_tree.h
> > +++ b/include/uapi/linux/btrfs_tree.h
> > @@ -1242,6 +1242,13 @@ struct btrfs_qgroup_status_item {
> >  	 * of the scan. It contains a logical address
> >  	 */
> >  	__le64 rescan;
> > +
> > +	/*
> > +	 * the generation when quotas are enabled. Used by simple quotas to
> > +	 * avoid decrementing when freeing an extent that was written before
> > +	 * enable.
> > +	 */
> > +	__le64 enable_gen;
> 
> This is public interface and btrfs_qgroup_status_item is used in many
> places in user space at least in btrfs-progs. This needs a lot of
> sanity checks.

Totally agreed in principle, but not exactly sure how to proceed in
practice. I would definitely appreciate some tips/help!

How we interact with the new field:
- When enabling squota, set it, the incompat bit, and the status flag
- When reading in the qgroup status_item, if the status flag is set,
  then read the enable_gen.

I believe this prevents us from ever reading garbage while trying to
read an old fs (status flag won't be set) and it prevents any
btrfs-progs from getting confused by a wrong-sized status item, since
it would choke on the incompat bit first.

Am I missing some other case? I can try to make it more explicitly
zeroed when we enable qgroups but not squotas? I can add an ASSERT that
the incompat bit is set as expected when we read the status item with
the flag on (that seems good no matter what)?

I can also write a wrapper for getting it which does the incompat/status
flag checking to make it more clear that it isn't safe to read in
general. Or a comment on the struct saying it depends on the incompat
bit?

Thanks for all the review, by the way.

> 
> >  } __attribute__ ((__packed__));
> >  
> >  struct btrfs_qgroup_info_item {
> > -- 
> > 2.41.0

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 06/18] btrfs: create qgroup earlier in snapshot creation
  2023-09-07 11:41   ` David Sterba
@ 2023-09-08 22:50     ` Boris Burkov
  0 siblings, 0 replies; 53+ messages in thread
From: Boris Burkov @ 2023-09-08 22:50 UTC (permalink / raw)
  To: David Sterba; +Cc: linux-btrfs, kernel-team

On Thu, Sep 07, 2023 at 01:41:35PM +0200, David Sterba wrote:
> On Thu, Jul 27, 2023 at 03:12:53PM -0700, Boris Burkov wrote:
> > Pull creating the qgroup earlier in the snapshot. This allows simple
> > quotas qgroups to see all the metadata writes related to the snapshot
> > being created and to be born with the root node accounted.
> > 
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >  fs/btrfs/qgroup.c      | 3 +++
> >  fs/btrfs/transaction.c | 6 ++++++
> >  2 files changed, 9 insertions(+)
> > 
> > diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> > index 18f521716e8d..8e3a4ced3077 100644
> > --- a/fs/btrfs/qgroup.c
> > +++ b/fs/btrfs/qgroup.c
> > @@ -1672,6 +1672,9 @@ int btrfs_create_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid)
> >  	struct btrfs_qgroup *qgroup;
> >  	int ret = 0;
> >  
> > +	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_DISABLED)
> > +		return 0;
> > +
> >  	mutex_lock(&fs_info->qgroup_ioctl_lock);
> >  	if (!fs_info->quota_root) {
> >  		ret = -ENOTCONN;
> > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> > index 89ff15aa085f..25217888e897 100644
> > --- a/fs/btrfs/transaction.c
> > +++ b/fs/btrfs/transaction.c
> > @@ -1722,6 +1722,12 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
> >  	}
> >  	btrfs_release_path(path);
> >  
> > +	ret = btrfs_create_qgroup(trans, objectid);
> > +	if (ret) {
> > +		btrfs_abort_transaction(trans, ret);
> 
> This adds and error case to the middle of a transaction commit.
> Snapshots are created in two parts, first is the ioctl adding the
> structure and then commit actually creates that. So the first phase
> preallocates what's needed (the root_item and path) and should do the
> same with the qgroups as much as possible.
> 
> Also check all the things that btrfs_create_qgroup() does, searches the
> qgroup tree, adds the new item, takes the qgroup_ioctl_lock mutex, and
> adds the sysfs entry (that does allocations under GFP_KERNEL).

I believe it does it with GFP_NOFS via allocating "prealloc". I might be
missing another allocation under the covers. That's covered below,
though.

> If you really need to create the qgroup like that then it needs much
> more care.

As I understand it, the way that the qgroup gets created currently is by
qgroup_account_snapshot which calls btrfs_qgroup_inherit in this same
function.

btrfs_create_qgroup consists of:
- lock qgroup_ioctl_lock
- do an rbtree lookup for the qgid
- do a NOFS "prealloc" allocation for the qgroup struct
- add the qgroup item
- add it to the rbtree
- add it to sysfs (using the above nofs prealloc)

With the exception of the qgroup_ioctl_lock, all those are in
btrfs_qgroup_inherit (and much more).

So that is all happening within create_pending_snapshot and thus the
commit critical section. It also does other work like backref walks,
and committing the roots.

Am I missing something important about the relative parts of
create_pending_snapshots where this work is happening? My intent was to
pull it up to before the run_delayed_refs in create_pending_snapshots
so that the new dir metadata item gets counted correctly. I think I may
have gotten delayed_refs and delayed_items confused and pulled it up
*too* far, and can probably stuff it earlier in that account function
or something.

Apologies if I am fundamentally misunderstanding something here.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 15/18] btrfs: check generation when recording simple quota delta
  2023-09-08 21:41     ` Boris Burkov
@ 2023-09-11 18:00       ` David Sterba
  2023-09-13  0:17         ` Boris Burkov
  0 siblings, 1 reply; 53+ messages in thread
From: David Sterba @ 2023-09-11 18:00 UTC (permalink / raw)
  To: Boris Burkov; +Cc: David Sterba, linux-btrfs, kernel-team

On Fri, Sep 08, 2023 at 02:41:46PM -0700, Boris Burkov wrote:
> On Thu, Sep 07, 2023 at 02:24:49PM +0200, David Sterba wrote:
> > On Thu, Jul 27, 2023 at 03:13:02PM -0700, Boris Burkov wrote:
> > > Simple quotas count extents only from the moment the feature is enabled.
> > > Therefore, if we do something like:
> > > 1. create subvol S
> > > 2. write F in S
> > > 3. enable quotas
> > > 4. remove F
> > > 5. write G in S
> > > 
> > > then after 3. and 4. we would expect the simple quota usage of S to be 0
> > > (putting aside some metadata extents that might be written) and after
> > > 5., it should be the size of G plus metadata. Therefore, we need to be
> > > able to determine whether a particular quota delta we are processing
> > > predates simple quota enablement.
> > > 
> > > To do this, store the transaction id when quotas were enabled. In
> > > fs_info for immediate use and in the quota status item to make it
> > > recoverable on mount. When we see a delta, check if the generation of
> > > the extent item is less than that of quota enablement. If so, we should
> > > ignore the delta from this extent.
> > > 
> > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > ---
> > >  fs/btrfs/accessors.h            |  2 ++
> > >  fs/btrfs/extent-tree.c          |  4 ++++
> > >  fs/btrfs/fs.h                   |  2 ++
> > >  fs/btrfs/qgroup.c               | 14 ++++++++++++--
> > >  fs/btrfs/qgroup.h               |  1 +
> > >  include/uapi/linux/btrfs_tree.h |  7 +++++++
> > >  6 files changed, 28 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
> > > index a23045c05937..513f8edbd98e 100644
> > > --- a/fs/btrfs/accessors.h
> > > +++ b/fs/btrfs/accessors.h
> > > @@ -970,6 +970,8 @@ BTRFS_SETGET_FUNCS(qgroup_status_flags, struct btrfs_qgroup_status_item,
> > >  		   flags, 64);
> > >  BTRFS_SETGET_FUNCS(qgroup_status_rescan, struct btrfs_qgroup_status_item,
> > >  		   rescan, 64);
> > > +BTRFS_SETGET_FUNCS(qgroup_status_enable_gen, struct btrfs_qgroup_status_item,
> > > +		   enable_gen, 64);
> > >  
> > >  /* btrfs_qgroup_info_item */
> > >  BTRFS_SETGET_FUNCS(qgroup_info_generation, struct btrfs_qgroup_info_item,
> > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > > index 1b5efd03ef83..395ab46e520b 100644
> > > --- a/fs/btrfs/extent-tree.c
> > > +++ b/fs/btrfs/extent-tree.c
> > > @@ -1513,6 +1513,7 @@ static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
> > >  			.rsv_bytes = href->reserved_bytes,
> > >  			.is_data = true,
> > >  			.is_inc	= true,
> > > +			.generation = trans->transid,
> > >  		};
> > >  
> > >  		if (extent_op)
> > > @@ -1676,6 +1677,7 @@ static int run_delayed_tree_ref(struct btrfs_trans_handle *trans,
> > >  			.rsv_bytes = 0,
> > >  			.is_data = false,
> > >  			.is_inc = true,
> > > +			.generation = trans->transid,
> > >  		};
> > >  
> > >  		BUG_ON(!extent_op || !extent_op->update_flags);
> > > @@ -3217,6 +3219,7 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
> > >  			.rsv_bytes = 0,
> > >  			.is_data = is_data,
> > >  			.is_inc = false,
> > > +			.generation = btrfs_extent_generation(leaf, ei),
> > >  		};
> > >  
> > >  		/* In this branch refs == 1 */
> > > @@ -4850,6 +4853,7 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
> > >  	struct btrfs_simple_quota_delta delta = {
> > >  		.root = root_objectid,
> > >  		.num_bytes = ins->offset,
> > > +		.generation = trans->transid,
> > >  		.rsv_bytes = 0,
> > >  		.is_data = true,
> > >  		.is_inc = true,
> > > diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> > > index f76f450c2abf..da7b623ff15f 100644
> > > --- a/fs/btrfs/fs.h
> > > +++ b/fs/btrfs/fs.h
> > > @@ -802,6 +802,8 @@ struct btrfs_fs_info {
> > >  	spinlock_t eb_leak_lock;
> > >  	struct list_head allocated_ebs;
> > >  #endif
> > > +
> > > +	u64 quota_enable_gen;
> > 
> > Please move it to the other quota/qgroup related members, at the end of
> > fs_info there's only debugging stuff.
> > 
> > >  };
> > >  
> > >  static inline void btrfs_set_last_root_drop_gen(struct btrfs_fs_info *fs_info,
> > > diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> > > index 58e9ed0deedd..a8a603242431 100644
> > > --- a/fs/btrfs/qgroup.c
> > > +++ b/fs/btrfs/qgroup.c
> > > @@ -454,6 +454,8 @@ int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info)
> > >  			}
> > >  			fs_info->qgroup_flags = btrfs_qgroup_status_flags(l, ptr);
> > >  			simple = fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_SIMPLE;
> > > +			if (simple)
> > > +				fs_info->quota_enable_gen = btrfs_qgroup_status_enable_gen(l, ptr);
> > >  			if (btrfs_qgroup_status_generation(l, ptr) !=
> > >  			    fs_info->generation && !simple) {
> > >  				qgroup_mark_inconsistent(fs_info);
> > > @@ -1107,10 +1109,12 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info,
> > >  	btrfs_set_qgroup_status_generation(leaf, ptr, trans->transid);
> > >  	btrfs_set_qgroup_status_version(leaf, ptr, BTRFS_QGROUP_STATUS_VERSION);
> > >  	fs_info->qgroup_flags = BTRFS_QGROUP_STATUS_FLAG_ON;
> > > -	if (simple)
> > > +	if (simple) {
> > >  		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_SIMPLE;
> > > -	else
> > > +		btrfs_set_qgroup_status_enable_gen(leaf, ptr, trans->transid);
> > > +	} else {
> > >  		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
> > > +	}
> > >  	btrfs_set_qgroup_status_flags(leaf, ptr, fs_info->qgroup_flags &
> > >  				      BTRFS_QGROUP_STATUS_FLAGS_MASK);
> > >  	btrfs_set_qgroup_status_rescan(leaf, ptr, 0);
> > > @@ -1202,6 +1206,8 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info,
> > >  		goto out_free_path;
> > >  	}
> > >  
> > > +	fs_info->quota_enable_gen = trans->transid;
> > > +
> > >  	mutex_unlock(&fs_info->qgroup_ioctl_lock);
> > >  	/*
> > >  	 * Commit the transaction while not holding qgroup_ioctl_lock, to avoid
> > > @@ -4622,6 +4628,10 @@ int btrfs_record_simple_quota_delta(struct btrfs_fs_info *fs_info,
> > >  	if (!is_fstree(root))
> > >  		return 0;
> > >  
> > > +	/* If the extent predates enabling quotas, don't count it. */
> > > +	if (delta->generation < fs_info->quota_enable_gen)
> > > +		return 0;
> > > +
> > >  	spin_lock(&fs_info->qgroup_lock);
> > >  	qgroup = find_qgroup_rb(fs_info, root);
> > >  	if (!qgroup) {
> > > diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
> > > index ce6fa8694ca7..ae1ce14b365c 100644
> > > --- a/fs/btrfs/qgroup.h
> > > +++ b/fs/btrfs/qgroup.h
> > > @@ -241,6 +241,7 @@ struct btrfs_simple_quota_delta {
> > >  	u64 rsv_bytes; /* The number of bytes reserved for this extent */
> > >  	bool is_inc; /* Whether we are using or freeing the extent */
> > >  	bool is_data; /* Whether the extent is data or metadata */
> > > +	u64 generation; /* The generation the extent was created in */
> > 
> > Please reorder it so it does not leave gaps between struct members.
> > 
> > >  };
> > >  
> > >  static inline u64 btrfs_qgroup_subvolid(u64 qgroupid)
> > > diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> > > index eacb26caf3c6..1120ce3dae42 100644
> > > --- a/include/uapi/linux/btrfs_tree.h
> > > +++ b/include/uapi/linux/btrfs_tree.h
> > > @@ -1242,6 +1242,13 @@ struct btrfs_qgroup_status_item {
> > >  	 * of the scan. It contains a logical address
> > >  	 */
> > >  	__le64 rescan;
> > > +
> > > +	/*
> > > +	 * the generation when quotas are enabled. Used by simple quotas to
> > > +	 * avoid decrementing when freeing an extent that was written before
> > > +	 * enable.
> > > +	 */
> > > +	__le64 enable_gen;
> > 
> > This is public interface and btrfs_qgroup_status_item is used in many
> > places in user space at least in btrfs-progs. This needs a lot of
> > sanity checks.
> 
> Totally agreed in principle, but not exactly sure how to proceed in
> practice. I would definitely appreciate some tips/help!
> 
> How we interact with the new field:
> - When enabling squota, set it, the incompat bit, and the status flag
> - When reading in the qgroup status_item, if the status flag is set,
>   then read the enable_gen.
> 
> I believe this prevents us from ever reading garbage while trying to
> read an old fs (status flag won't be set) and it prevents any
> btrfs-progs from getting confused by a wrong-sized status item, since
> it would choke on the incompat bit first.
> 
> Am I missing some other case? I can try to make it more explicitly
> zeroed when we enable qgroups but not squotas? I can add an ASSERT that
> the incompat bit is set as expected when we read the status item with
> the flag on (that seems good no matter what)?
> 
> I can also write a wrapper for getting it which does the incompat/status
> flag checking to make it more clear that it isn't safe to read in
> general. Or a comment on the struct saying it depends on the incompat
> bit?

All of the above makes sense and I had something like that in mind when
writing the comment. The wrappers can make sure the bit is set when
reading the item. I think there's an example in existing code that
versions an item based on size, I can't find it now (probably something
from the send/receive time where several new struct members were added).

I just noticed we have versioning for the qgoup status item,
BTRFS_QGROUP_STATUS_VERSION is now 1 and has backward compatibility
handling. We can probably use version 2 for squotas, in addition to the
helpers with sanity checks.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 00/18] btrfs: simple quotas
  2023-09-07 20:51   ` Boris Burkov
@ 2023-09-11 18:06     ` David Sterba
  0 siblings, 0 replies; 53+ messages in thread
From: David Sterba @ 2023-09-11 18:06 UTC (permalink / raw)
  To: Boris Burkov; +Cc: David Sterba, linux-btrfs, kernel-team

On Thu, Sep 07, 2023 at 01:51:31PM -0700, Boris Burkov wrote:
> On Thu, Sep 07, 2023 at 12:51:15PM +0200, David Sterba wrote:
> > On Thu, Jul 27, 2023 at 03:12:47PM -0700, Boris Burkov wrote:
> > > btrfs quota groups (qgroups) are a compelling feature of btrfs that
> > > allow flexible control for limiting subvolume data and metadata usage.
> > > However, due to btrfs's high level decision to tradeoff snapshot
> > > performance against ref-counting performance, qgroups suffer from
> > > non-trivial performance issues that make them unattractive in certain
> > > workloads. Particularly, frequent backref walking during writes and
> > > during commits can make operations increasingly expensive as the number
> > > of snapshots scales up. For that reason, we have never been able to
> > > commit to using qgroups in production at Meta, despite significant
> > > interest from people running container workloads, where we would benefit
> > > from protecting the rest of the host from a buggy application in a
> > > container running away with disk usage. This patch series introduces a
> > > simplified version of qgroups called
> > > simple quotas (squotas) which never computes global reference counts
> > > for extents, and thus has similar performance characteristics to normal,
> > > quotas disabled, btrfs. The "trick" is that in simple quotas mode, we
> > > account all extents permanently to the subvolume in which they were
> > > originally created. That allows us to make all accounting 1:1 with
> > > extent item lifetime, removing the need to walk backrefs. However,
> > > this sacrifices the ability to compute shared vs. exclusive usage. It
> > > also results in counter-intuitive, though still predictable and simple
> > > accounting in the cases where an original extent is removed while a
> > > shared copy still exists. Qgroups is able to detect that case and count
> > > the remaining copy as an exclusive owner, while squotas is not. As a
> > > result, squotas works best when the original extent is immutable and
> > > outlives any clones.
> > > 
> > > ==Format Change==
> > > In order to track the original creating subvolume of a data extent in
> > > the face of reflinks, it is necessary to add additional accounting to
> > > the extent item. To save space, this is done with a new inline ref item.
> > > However, the downside of this approach is that it makes enabling squota
> > > an incompat change, denoted by the new incompat bit SIMPLE_QUOTA. When
> > > this bit is set and quotas are enabled, new extent items get the extra
> > > accounting, and freed extent items check for the accounting to find
> > > their creating subvolume. In addition, 1:1 with this incompat bit,
> > > the quota status item now tracks a "quota enablement generation" needed
> > > for properly handling deleting extents with predate enablement.
> > > 
> > > ==API==
> > > Squotas reuses the api of qgroups.
> > 
> > So apart from the accounting, the hierarchy of qgroups can be still
> > built as before, right? In the example you create a group 1/100 so I
> > assume that it's still qgroups from the outside, and that the limits can
> > be set.
> 
> Yes, you can create quota group hierarchies with the same nesting
> behavior. I am only changing the accounting methodology (and added auto
> hierarchy)

OK, makes sense. The hierarchy does not need to be used and is probably
less practical for the simple accounting. What I had in mind was some
kind of flat hierarchy, now that the simple accounting is there. People
were asking about that in the past, wit the drawback of lack of
shared/exclusive accounting. Adding a separate subcommands and tooling
around flat quotas could be done but with squotas as well, just "don't
use the hierarchy".

> > Because if not, then squotas would make more sense as a separate
> > infrastructure, under quotas. Like that quotas are the abstraction while
> > qgroups or squota would be the implementation.
> > 
> > > The only difference is that when you
> > > enable quotas via `btrfs quota enable`, you pass the `--simple` flag.
> > > Squotas will always report exclusive == shared for each qgroup. Squotas
> > > deal with extent_item/metadata_item sizes and thus do not do anything
> > > special with compression. Squotas also introduce auto inheritance for
> > > nested subvols. The API is documented more fully in the documentation
> > > patches in btrfs-progs.
> > 
> > The lack of exclusive size sharing will be confusing I guess, so we need
> > to make it clear in the documentation and in the UI that it's either
> > full or simple mode.
> 
> I am happy to iterate on that. I think always reporting as shared=0,
> since the *ownership* is exclusive. I opted for making them equal since
> it sort of both shared usage (we don't know if it's shared nor when it
> will be freed) and exclusive usage (belongs to this subvol by owner ref)

I agree with that reasoning.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 00/18] btrfs: simple quotas
  2023-09-07 10:51 ` [PATCH v5 00/18] btrfs: simple quotas David Sterba
  2023-09-07 20:51   ` Boris Burkov
@ 2023-09-11 18:12   ` David Sterba
  1 sibling, 0 replies; 53+ messages in thread
From: David Sterba @ 2023-09-11 18:12 UTC (permalink / raw)
  To: David Sterba; +Cc: Boris Burkov, linux-btrfs, kernel-team

On Thu, Sep 07, 2023 at 12:51:15PM +0200, David Sterba wrote:
> On Thu, Jul 27, 2023 at 03:12:47PM -0700, Boris Burkov wrote:
> I've added the patchset to for-next,

There's a merge conflict due to Filipe's delayed refs changes,
"btrfs: record simple quota deltas" new parameter to
__btrfs_free_extent, run_delayed_data_ref and maybe others. I may
resolve that for for-next but this could duplicate work if you that too
so I can wait for a resend with other things updated.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 15/18] btrfs: check generation when recording simple quota delta
  2023-09-11 18:00       ` David Sterba
@ 2023-09-13  0:17         ` Boris Burkov
  0 siblings, 0 replies; 53+ messages in thread
From: Boris Burkov @ 2023-09-13  0:17 UTC (permalink / raw)
  To: David Sterba; +Cc: linux-btrfs, kernel-team

On Mon, Sep 11, 2023 at 08:00:20PM +0200, David Sterba wrote:
> On Fri, Sep 08, 2023 at 02:41:46PM -0700, Boris Burkov wrote:
> > On Thu, Sep 07, 2023 at 02:24:49PM +0200, David Sterba wrote:
> > > On Thu, Jul 27, 2023 at 03:13:02PM -0700, Boris Burkov wrote:
> > > > Simple quotas count extents only from the moment the feature is enabled.
> > > > Therefore, if we do something like:
> > > > 1. create subvol S
> > > > 2. write F in S
> > > > 3. enable quotas
> > > > 4. remove F
> > > > 5. write G in S
> > > > 
> > > > then after 3. and 4. we would expect the simple quota usage of S to be 0
> > > > (putting aside some metadata extents that might be written) and after
> > > > 5., it should be the size of G plus metadata. Therefore, we need to be
> > > > able to determine whether a particular quota delta we are processing
> > > > predates simple quota enablement.
> > > > 
> > > > To do this, store the transaction id when quotas were enabled. In
> > > > fs_info for immediate use and in the quota status item to make it
> > > > recoverable on mount. When we see a delta, check if the generation of
> > > > the extent item is less than that of quota enablement. If so, we should
> > > > ignore the delta from this extent.
> > > > 
> > > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > > ---
> > > >  fs/btrfs/accessors.h            |  2 ++
> > > >  fs/btrfs/extent-tree.c          |  4 ++++
> > > >  fs/btrfs/fs.h                   |  2 ++
> > > >  fs/btrfs/qgroup.c               | 14 ++++++++++++--
> > > >  fs/btrfs/qgroup.h               |  1 +
> > > >  include/uapi/linux/btrfs_tree.h |  7 +++++++
> > > >  6 files changed, 28 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
> > > > index a23045c05937..513f8edbd98e 100644
> > > > --- a/fs/btrfs/accessors.h
> > > > +++ b/fs/btrfs/accessors.h
> > > > @@ -970,6 +970,8 @@ BTRFS_SETGET_FUNCS(qgroup_status_flags, struct btrfs_qgroup_status_item,
> > > >  		   flags, 64);
> > > >  BTRFS_SETGET_FUNCS(qgroup_status_rescan, struct btrfs_qgroup_status_item,
> > > >  		   rescan, 64);
> > > > +BTRFS_SETGET_FUNCS(qgroup_status_enable_gen, struct btrfs_qgroup_status_item,
> > > > +		   enable_gen, 64);
> > > >  
> > > >  /* btrfs_qgroup_info_item */
> > > >  BTRFS_SETGET_FUNCS(qgroup_info_generation, struct btrfs_qgroup_info_item,
> > > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > > > index 1b5efd03ef83..395ab46e520b 100644
> > > > --- a/fs/btrfs/extent-tree.c
> > > > +++ b/fs/btrfs/extent-tree.c
> > > > @@ -1513,6 +1513,7 @@ static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
> > > >  			.rsv_bytes = href->reserved_bytes,
> > > >  			.is_data = true,
> > > >  			.is_inc	= true,
> > > > +			.generation = trans->transid,
> > > >  		};
> > > >  
> > > >  		if (extent_op)
> > > > @@ -1676,6 +1677,7 @@ static int run_delayed_tree_ref(struct btrfs_trans_handle *trans,
> > > >  			.rsv_bytes = 0,
> > > >  			.is_data = false,
> > > >  			.is_inc = true,
> > > > +			.generation = trans->transid,
> > > >  		};
> > > >  
> > > >  		BUG_ON(!extent_op || !extent_op->update_flags);
> > > > @@ -3217,6 +3219,7 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
> > > >  			.rsv_bytes = 0,
> > > >  			.is_data = is_data,
> > > >  			.is_inc = false,
> > > > +			.generation = btrfs_extent_generation(leaf, ei),
> > > >  		};
> > > >  
> > > >  		/* In this branch refs == 1 */
> > > > @@ -4850,6 +4853,7 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
> > > >  	struct btrfs_simple_quota_delta delta = {
> > > >  		.root = root_objectid,
> > > >  		.num_bytes = ins->offset,
> > > > +		.generation = trans->transid,
> > > >  		.rsv_bytes = 0,
> > > >  		.is_data = true,
> > > >  		.is_inc = true,
> > > > diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> > > > index f76f450c2abf..da7b623ff15f 100644
> > > > --- a/fs/btrfs/fs.h
> > > > +++ b/fs/btrfs/fs.h
> > > > @@ -802,6 +802,8 @@ struct btrfs_fs_info {
> > > >  	spinlock_t eb_leak_lock;
> > > >  	struct list_head allocated_ebs;
> > > >  #endif
> > > > +
> > > > +	u64 quota_enable_gen;
> > > 
> > > Please move it to the other quota/qgroup related members, at the end of
> > > fs_info there's only debugging stuff.
> > > 
> > > >  };
> > > >  
> > > >  static inline void btrfs_set_last_root_drop_gen(struct btrfs_fs_info *fs_info,
> > > > diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> > > > index 58e9ed0deedd..a8a603242431 100644
> > > > --- a/fs/btrfs/qgroup.c
> > > > +++ b/fs/btrfs/qgroup.c
> > > > @@ -454,6 +454,8 @@ int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info)
> > > >  			}
> > > >  			fs_info->qgroup_flags = btrfs_qgroup_status_flags(l, ptr);
> > > >  			simple = fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_SIMPLE;
> > > > +			if (simple)
> > > > +				fs_info->quota_enable_gen = btrfs_qgroup_status_enable_gen(l, ptr);
> > > >  			if (btrfs_qgroup_status_generation(l, ptr) !=
> > > >  			    fs_info->generation && !simple) {
> > > >  				qgroup_mark_inconsistent(fs_info);
> > > > @@ -1107,10 +1109,12 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info,
> > > >  	btrfs_set_qgroup_status_generation(leaf, ptr, trans->transid);
> > > >  	btrfs_set_qgroup_status_version(leaf, ptr, BTRFS_QGROUP_STATUS_VERSION);
> > > >  	fs_info->qgroup_flags = BTRFS_QGROUP_STATUS_FLAG_ON;
> > > > -	if (simple)
> > > > +	if (simple) {
> > > >  		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_SIMPLE;
> > > > -	else
> > > > +		btrfs_set_qgroup_status_enable_gen(leaf, ptr, trans->transid);
> > > > +	} else {
> > > >  		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
> > > > +	}
> > > >  	btrfs_set_qgroup_status_flags(leaf, ptr, fs_info->qgroup_flags &
> > > >  				      BTRFS_QGROUP_STATUS_FLAGS_MASK);
> > > >  	btrfs_set_qgroup_status_rescan(leaf, ptr, 0);
> > > > @@ -1202,6 +1206,8 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info,
> > > >  		goto out_free_path;
> > > >  	}
> > > >  
> > > > +	fs_info->quota_enable_gen = trans->transid;
> > > > +
> > > >  	mutex_unlock(&fs_info->qgroup_ioctl_lock);
> > > >  	/*
> > > >  	 * Commit the transaction while not holding qgroup_ioctl_lock, to avoid
> > > > @@ -4622,6 +4628,10 @@ int btrfs_record_simple_quota_delta(struct btrfs_fs_info *fs_info,
> > > >  	if (!is_fstree(root))
> > > >  		return 0;
> > > >  
> > > > +	/* If the extent predates enabling quotas, don't count it. */
> > > > +	if (delta->generation < fs_info->quota_enable_gen)
> > > > +		return 0;
> > > > +
> > > >  	spin_lock(&fs_info->qgroup_lock);
> > > >  	qgroup = find_qgroup_rb(fs_info, root);
> > > >  	if (!qgroup) {
> > > > diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
> > > > index ce6fa8694ca7..ae1ce14b365c 100644
> > > > --- a/fs/btrfs/qgroup.h
> > > > +++ b/fs/btrfs/qgroup.h
> > > > @@ -241,6 +241,7 @@ struct btrfs_simple_quota_delta {
> > > >  	u64 rsv_bytes; /* The number of bytes reserved for this extent */
> > > >  	bool is_inc; /* Whether we are using or freeing the extent */
> > > >  	bool is_data; /* Whether the extent is data or metadata */
> > > > +	u64 generation; /* The generation the extent was created in */
> > > 
> > > Please reorder it so it does not leave gaps between struct members.
> > > 
> > > >  };
> > > >  
> > > >  static inline u64 btrfs_qgroup_subvolid(u64 qgroupid)
> > > > diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> > > > index eacb26caf3c6..1120ce3dae42 100644
> > > > --- a/include/uapi/linux/btrfs_tree.h
> > > > +++ b/include/uapi/linux/btrfs_tree.h
> > > > @@ -1242,6 +1242,13 @@ struct btrfs_qgroup_status_item {
> > > >  	 * of the scan. It contains a logical address
> > > >  	 */
> > > >  	__le64 rescan;
> > > > +
> > > > +	/*
> > > > +	 * the generation when quotas are enabled. Used by simple quotas to
> > > > +	 * avoid decrementing when freeing an extent that was written before
> > > > +	 * enable.
> > > > +	 */
> > > > +	__le64 enable_gen;
> > > 
> > > This is public interface and btrfs_qgroup_status_item is used in many
> > > places in user space at least in btrfs-progs. This needs a lot of
> > > sanity checks.
> > 
> > Totally agreed in principle, but not exactly sure how to proceed in
> > practice. I would definitely appreciate some tips/help!
> > 
> > How we interact with the new field:
> > - When enabling squota, set it, the incompat bit, and the status flag
> > - When reading in the qgroup status_item, if the status flag is set,
> >   then read the enable_gen.
> > 
> > I believe this prevents us from ever reading garbage while trying to
> > read an old fs (status flag won't be set) and it prevents any
> > btrfs-progs from getting confused by a wrong-sized status item, since
> > it would choke on the incompat bit first.
> > 
> > Am I missing some other case? I can try to make it more explicitly
> > zeroed when we enable qgroups but not squotas? I can add an ASSERT that
> > the incompat bit is set as expected when we read the status item with
> > the flag on (that seems good no matter what)?
> > 
> > I can also write a wrapper for getting it which does the incompat/status
> > flag checking to make it more clear that it isn't safe to read in
> > general. Or a comment on the struct saying it depends on the incompat
> > bit?
> 
> All of the above makes sense and I had something like that in mind when
> writing the comment. The wrappers can make sure the bit is set when
> reading the item. I think there's an example in existing code that
> versions an item based on size, I can't find it now (probably something
> from the send/receive time where several new struct members were added).
> 
> I just noticed we have versioning for the qgoup status item,
> BTRFS_QGROUP_STATUS_VERSION is now 1 and has backward compatibility
> handling. We can probably use version 2 for squotas, in addition to the
> helpers with sanity checks.

I made the helper/validation changes in V6, but forgot to address the
BTRFS_QGROUP_STATUS_VERSION idea. Right now, that field is for
preventing forward compatibility as well as backward (for lack of a
better term?) The check on it is a !=, so if you bump the version, you
can no longer honor old fs-es qgroups, which is not the case with this
change. Since the version has never been bumped, I believe we can safely
change it to a backwards compatibility check and use it for squotas.

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2023-09-13  0:16 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-27 22:12 [PATCH v5 00/18] btrfs: simple quotas Boris Burkov
2023-07-27 22:12 ` [PATCH v5 01/18] btrfs: introduce quota mode Boris Burkov
2023-07-27 22:12 ` [PATCH v5 02/18] btrfs: add new quota mode for simple quotas Boris Burkov
2023-08-21 18:00   ` Josef Bacik
2023-09-07 11:19   ` David Sterba
2023-07-27 22:12 ` [PATCH v5 03/18] btrfs: expose quota mode via sysfs Boris Burkov
2023-08-21 18:00   ` Josef Bacik
2023-09-07 11:25   ` David Sterba
2023-07-27 22:12 ` [PATCH v5 04/18] btrfs: add simple_quota incompat feature to sysfs Boris Burkov
2023-08-21 18:01   ` Josef Bacik
2023-09-07 11:28   ` David Sterba
2023-09-07 20:56     ` Boris Burkov
2023-07-27 22:12 ` [PATCH v5 05/18] btrfs: flush reservations during quota disable Boris Burkov
2023-07-27 22:12 ` [PATCH v5 06/18] btrfs: create qgroup earlier in snapshot creation Boris Burkov
2023-08-21 18:02   ` Josef Bacik
2023-09-07 11:41   ` David Sterba
2023-09-08 22:50     ` Boris Burkov
2023-07-27 22:12 ` [PATCH v5 07/18] btrfs: function for recording simple quota deltas Boris Burkov
2023-08-21 18:04   ` Josef Bacik
2023-09-07 11:46   ` David Sterba
2023-07-27 22:12 ` [PATCH v5 08/18] btrfs: rename tree_ref and data_ref owning_root Boris Burkov
2023-07-27 22:12 ` [PATCH v5 09/18] btrfs: track owning root in btrfs_ref Boris Burkov
2023-08-21 18:05   ` Josef Bacik
2023-07-27 22:12 ` [PATCH v5 10/18] btrfs: track original extent owner in head_ref Boris Burkov
2023-08-21 18:06   ` Josef Bacik
2023-09-07 11:54   ` David Sterba
2023-07-27 22:12 ` [PATCH v5 11/18] btrfs: new inline ref storing owning subvol of data extents Boris Burkov
2023-08-21 18:07   ` Josef Bacik
2023-09-07 12:06   ` David Sterba
2023-07-27 22:12 ` [PATCH v5 12/18] btrfs: inline owner ref lookup helper Boris Burkov
2023-09-07 12:10   ` David Sterba
2023-07-27 22:13 ` [PATCH v5 13/18] btrfs: record simple quota deltas Boris Burkov
2023-08-21 18:08   ` Josef Bacik
2023-09-07 12:12   ` David Sterba
2023-07-27 22:13 ` [PATCH v5 14/18] btrfs: simple quota auto hierarchy for nested subvols Boris Burkov
2023-08-21 18:10   ` Josef Bacik
2023-09-07 12:16   ` David Sterba
2023-07-27 22:13 ` [PATCH v5 15/18] btrfs: check generation when recording simple quota delta Boris Burkov
2023-08-21 18:11   ` Josef Bacik
2023-09-07 12:24   ` David Sterba
2023-09-08 21:41     ` Boris Burkov
2023-09-11 18:00       ` David Sterba
2023-09-13  0:17         ` Boris Burkov
2023-07-27 22:13 ` [PATCH v5 16/18] btrfs: track metadata relocation cow with simple quota Boris Burkov
2023-09-07 12:27   ` David Sterba
2023-07-27 22:13 ` [PATCH v5 17/18] btrfs: track data relocation " Boris Burkov
2023-08-21 18:16   ` Josef Bacik
2023-07-27 22:13 ` [PATCH v5 18/18] btrfs: only set QUOTA_ENABLED when done reading qgroups Boris Burkov
2023-08-21 18:16   ` Josef Bacik
2023-09-07 10:51 ` [PATCH v5 00/18] btrfs: simple quotas David Sterba
2023-09-07 20:51   ` Boris Burkov
2023-09-11 18:06     ` David Sterba
2023-09-11 18:12   ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.