Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit

All of lore.kernel.org
 help / color / mirror / Atom feed

* Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
@ 2022-03-06 15:59 Jan Ziak
  2022-03-07  0:48 ` Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 71+ messages in thread
From: Jan Ziak @ 2022-03-06 15:59 UTC (permalink / raw)
  To: linux-btrfs

I would like to report that btrfs in Linux kernel 5.16.12 mounted with
the autodefrag option wrote 5TB in a single day to a 1TB SSD that is
about 50% full.

Defragmenting 0.5TB on a drive that is 50% full should write far less than 5TB.

Benefits to the fragmentation of the most written files over the
course of the one day (sqlite database files) are nil. Please see the
data below. Also note that the sqlite file is using up to 10 GB more
than it should due to fragmentation.

CPU utilization on an otherwise idle machine is approximately 600% all
the time: btrfs-cleaner 100%, kworkers...btrfs 500%.

I am not just asking you to fix this issue - I am asking you how is it
possible for an algorithm that is significantly worse than O(N*log(N))
to be merged into the Linux kernel in the first place!?

Please try to avoid discussing no-CoW (chattr +C) in your response,
because it is beside the point. Thanks.

----

A day before:

$ smartctl -a /dev/nvme0n1 | grep Units
Data Units Read:                    449,265,485 [230 TB]
Data Units Written:                 406,386,721 [208 TB]

$ compsize file.sqlite
Processed 1 file, 1757129 regular extents (2934077 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL      100%       46G          46G          37G
none       100%       46G          46G          37G

----

A day after:

$ smartctl -a /dev/nvme0n1 | grep Units
Data Units Read:                    473,211,419 [242 TB]
Data Units Written:                 417,249,915 [213 TB]

$ compsize file.sqlite
Processed 1 file, 1834778 regular extents (3050838 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL      100%       47G          47G          37G
none       100%       47G          47G          37G

$ filefrag file.sqlite
(Ctrl-C after waiting more than 10 minutes, consuming 100% CPU)

----

Manual defragmentation decreased the file's size by 7 GB:

$ btrfs-defrag file.sqlite
$ sync
$ compsize file.sqlite
Processed 6 files, 13074 regular extents (20260 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL      100%       40G          40G          37G
none       100%       40G          40G          37G

----

Sincerely
Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-06 15:59 Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit Jan Ziak
@ 2022-03-07  0:48 ` Qu Wenruo
  2022-03-07  2:23   ` Jan Ziak
  2022-03-07 14:30 ` Phillip Susi
  2022-03-16 12:47 ` Kai Krakow
  2 siblings, 1 reply; 71+ messages in thread
From: Qu Wenruo @ 2022-03-07  0:48 UTC (permalink / raw)
  To: Jan Ziak, linux-btrfs



On 2022/3/6 23:59, Jan Ziak wrote:
> I would like to report that btrfs in Linux kernel 5.16.12 mounted with
> the autodefrag option wrote 5TB in a single day to a 1TB SSD that is
> about 50% full.
>
> Defragmenting 0.5TB on a drive that is 50% full should write far less than 5TB.

If using defrag ioctl, that's a good and solid expectation.

>
> Benefits to the fragmentation of the most written files over the
> course of the one day (sqlite database files) are nil. Please see the
> data below. Also note that the sqlite file is using up to 10 GB more
> than it should due to fragmentation.

Autodefrag will mark any file which got smaller writes (<64K) for scan.
For smaller extents than 64K, they will be re-dirtied for writeback.

So in theory, if the cleaner is triggered very frequently to do
autodefrag, it can indeed easily amplify the writes.

Are you using commit= mount option? Which would reduce the commit
interval thus trigger autodefrag more frequently.

>
> CPU utilization on an otherwise idle machine is approximately 600% all
> the time: btrfs-cleaner 100%, kworkers...btrfs 500%.

The problem is why the CPU usage is at 100% for cleaner.

Would you please apply this patch on your kernel?
https://patchwork.kernel.org/project/linux-btrfs/patch/bf2635d213e0c85251c4cd0391d8fbf274d7d637.1645705266.git.wqu@suse.com/

Then enable the following trace events:

  btrfs:defrag_one_locked_range
  btrfs:defrag_add_target
  btrfs:defrag_file_start
  btrfs:defrag_file_end

Those trace events would show why we're doing the same re-dirty again
and again, and mostly why the CPU usage is so high.

Thanks,
Qu

>
> I am not just asking you to fix this issue - I am asking you how is it
> possible for an algorithm that is significantly worse than O(N*log(N))
> to be merged into the Linux kernel in the first place!?
>
> Please try to avoid discussing no-CoW (chattr +C) in your response,
> because it is beside the point. Thanks.
>
> ----
>
> A day before:
>
> $ smartctl -a /dev/nvme0n1 | grep Units
> Data Units Read:                    449,265,485 [230 TB]
> Data Units Written:                 406,386,721 [208 TB]
>
> $ compsize file.sqlite
> Processed 1 file, 1757129 regular extents (2934077 refs), 0 inline.
> Type       Perc     Disk Usage   Uncompressed Referenced
> TOTAL      100%       46G          46G          37G
> none       100%       46G          46G          37G
>
> ----
>
> A day after:
>
> $ smartctl -a /dev/nvme0n1 | grep Units
> Data Units Read:                    473,211,419 [242 TB]
> Data Units Written:                 417,249,915 [213 TB]
>
> $ compsize file.sqlite
> Processed 1 file, 1834778 regular extents (3050838 refs), 0 inline.
> Type       Perc     Disk Usage   Uncompressed Referenced
> TOTAL      100%       47G          47G          37G
> none       100%       47G          47G          37G
>
> $ filefrag file.sqlite
> (Ctrl-C after waiting more than 10 minutes, consuming 100% CPU)
>
> ----
>
> Manual defragmentation decreased the file's size by 7 GB:
>
> $ btrfs-defrag file.sqlite
> $ sync
> $ compsize file.sqlite
> Processed 6 files, 13074 regular extents (20260 refs), 0 inline.
> Type       Perc     Disk Usage   Uncompressed Referenced
> TOTAL      100%       40G          40G          37G
> none       100%       40G          40G          37G
>
> ----
>
> Sincerely
> Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-07  0:48 ` Qu Wenruo
@ 2022-03-07  2:23   ` Jan Ziak
  2022-03-07  2:39     ` Qu Wenruo
  0 siblings, 1 reply; 71+ messages in thread
From: Jan Ziak @ 2022-03-07  2:23 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Mar 7, 2022 at 1:48 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> On 2022/3/6 23:59, Jan Ziak wrote:
> > I would like to report that btrfs in Linux kernel 5.16.12 mounted with
> > the autodefrag option wrote 5TB in a single day to a 1TB SSD that is
> > about 50% full.
> >
> > Defragmenting 0.5TB on a drive that is 50% full should write far less than 5TB.
>
> If using defrag ioctl, that's a good and solid expectation.
>
> Autodefrag will mark any file which got smaller writes (<64K) for scan.
> For smaller extents than 64K, they will be re-dirtied for writeback.

The NVMe device has 512-byte sectors, but has another namespace with
4K sectors. Will it help btrfs-autodefrag to reformat the drive to 4K
sectors? I expect that it won't help - I am asking just in case my
expectation is wrong.

> So in theory, if the cleaner is triggered very frequently to do
> autodefrag, it can indeed easily amplify the writes.

According to usr/bin/glances, the sqlite app is writing less than 1 MB
per second to the NVMe device. btrfs's autodefrag write amplification
is from the 1 MB/s to approximately 200 MB/s.

> Are you using commit= mount option? Which would reduce the commit
> interval thus trigger autodefrag more frequently.

I am not using commit= mount option.

> > CPU utilization on an otherwise idle machine is approximately 600% all
> > the time: btrfs-cleaner 100%, kworkers...btrfs 500%.
>
> The problem is why the CPU usage is at 100% for cleaner.
>
> Would you please apply this patch on your kernel?
> https://patchwork.kernel.org/project/linux-btrfs/patch/bf2635d213e0c85251c4cd0391d8fbf274d7d637.1645705266.git.wqu@suse.com/
>
> Then enable the following trace events...

I will try to apply the patch, collect the events and post the
results. First, I will wait for the sqlite file to gain about 1
million extents, which shouldn't take too long.

----

BTW: "compsize file-with-million-extents" finishes in 0.2 seconds
(uses BTRFS_IOC_TREE_SEARCH_V2 ioctl), but "filefrag
file-with-million-extents" doesn't finish even after several minutes
of time (uses FS_IOC_FIEMAP ioctl - manages to perform only about 5
ioctl syscalls per second - and appears to be slowing down as the
value of the "fm_start" ioctl argument grows; e2fsprogs version
1.46.5). It would be nice if filefrag was faster than just a few
ioctls per second.

----

Sincerely
Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-07  2:23   ` Jan Ziak
@ 2022-03-07  2:39     ` Qu Wenruo
  2022-03-07  7:31       ` Qu Wenruo
  2022-03-08 21:57       ` Jan Ziak
  0 siblings, 2 replies; 71+ messages in thread
From: Qu Wenruo @ 2022-03-07  2:39 UTC (permalink / raw)
  To: Jan Ziak; +Cc: linux-btrfs



On 2022/3/7 10:23, Jan Ziak wrote:
> On Mon, Mar 7, 2022 at 1:48 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> On 2022/3/6 23:59, Jan Ziak wrote:
>>> I would like to report that btrfs in Linux kernel 5.16.12 mounted with
>>> the autodefrag option wrote 5TB in a single day to a 1TB SSD that is
>>> about 50% full.
>>>
>>> Defragmenting 0.5TB on a drive that is 50% full should write far less than 5TB.
>>
>> If using defrag ioctl, that's a good and solid expectation.
>>
>> Autodefrag will mark any file which got smaller writes (<64K) for scan.
>> For smaller extents than 64K, they will be re-dirtied for writeback.
>
> The NVMe device has 512-byte sectors, but has another namespace with
> 4K sectors. Will it help btrfs-autodefrag to reformat the drive to 4K
> sectors? I expect that it won't help - I am asking just in case my
> expectation is wrong.

The minimal sector size of btrfs is 4K, so I don't believe it would
cause any difference.

>
>> So in theory, if the cleaner is triggered very frequently to do
>> autodefrag, it can indeed easily amplify the writes.
>
> According to usr/bin/glances, the sqlite app is writing less than 1 MB
> per second to the NVMe device. btrfs's autodefrag write amplification
> is from the 1 MB/s to approximately 200 MB/s.

This is definitely something wrong.

Autodefrag by default should only get triggered every 300s, thus even
all new bytes are re-dirtied, it should only cause a less than 300M
write burst every 300s, not a consistent write.

>
>> Are you using commit= mount option? Which would reduce the commit
>> interval thus trigger autodefrag more frequently.
>
> I am not using commit= mount option.
>
>>> CPU utilization on an otherwise idle machine is approximately 600% all
>>> the time: btrfs-cleaner 100%, kworkers...btrfs 500%.
>>
>> The problem is why the CPU usage is at 100% for cleaner.
>>
>> Would you please apply this patch on your kernel?
>> https://patchwork.kernel.org/project/linux-btrfs/patch/bf2635d213e0c85251c4cd0391d8fbf274d7d637.1645705266.git.wqu@suse.com/
>>
>> Then enable the following trace events...
>
> I will try to apply the patch, collect the events and post the
> results. First, I will wait for the sqlite file to gain about 1
> million extents, which shouldn't take too long.

Thank you very much for the future trace events log.

That would be the determining data for us to solve it.

>
> ----
>
> BTW: "compsize file-with-million-extents" finishes in 0.2 seconds
> (uses BTRFS_IOC_TREE_SEARCH_V2 ioctl), but "filefrag
> file-with-million-extents" doesn't finish even after several minutes
> of time (uses FS_IOC_FIEMAP ioctl - manages to perform only about 5
> ioctl syscalls per second - and appears to be slowing down as the
> value of the "fm_start" ioctl argument grows; e2fsprogs version
> 1.46.5). It would be nice if filefrag was faster than just a few
> ioctls per second.

This is mostly a race with autodefrag.

Both are using file extent map, thus if autodefrag is still trying to
redirty the file again and again, it would definitely cause problems for
anything also using file extent map.

Thanks,
Qu
>
> ----
>
> Sincerely
> Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-07  2:39     ` Qu Wenruo
@ 2022-03-07  7:31       ` Qu Wenruo
  2022-03-10  1:10         ` Jan Ziak
  2022-03-08 21:57       ` Jan Ziak
  1 sibling, 1 reply; 71+ messages in thread
From: Qu Wenruo @ 2022-03-07  7:31 UTC (permalink / raw)
  To: Jan Ziak; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3611 bytes --]



On 2022/3/7 10:39, Qu Wenruo wrote:
>
>
> On 2022/3/7 10:23, Jan Ziak wrote:
>> On Mon, Mar 7, 2022 at 1:48 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>> On 2022/3/6 23:59, Jan Ziak wrote:
>>>> I would like to report that btrfs in Linux kernel 5.16.12 mounted with
>>>> the autodefrag option wrote 5TB in a single day to a 1TB SSD that is
>>>> about 50% full.
>>>>
>>>> Defragmenting 0.5TB on a drive that is 50% full should write far
>>>> less than 5TB.
>>>
>>> If using defrag ioctl, that's a good and solid expectation.
>>>
>>> Autodefrag will mark any file which got smaller writes (<64K) for scan.
>>> For smaller extents than 64K, they will be re-dirtied for writeback.
>>
>> The NVMe device has 512-byte sectors, but has another namespace with
>> 4K sectors. Will it help btrfs-autodefrag to reformat the drive to 4K
>> sectors? I expect that it won't help - I am asking just in case my
>> expectation is wrong.
>
> The minimal sector size of btrfs is 4K, so I don't believe it would
> cause any difference.
>
>>
>>> So in theory, if the cleaner is triggered very frequently to do
>>> autodefrag, it can indeed easily amplify the writes.
>>
>> According to usr/bin/glances, the sqlite app is writing less than 1 MB
>> per second to the NVMe device. btrfs's autodefrag write amplification
>> is from the 1 MB/s to approximately 200 MB/s.
>
> This is definitely something wrong.
>
> Autodefrag by default should only get triggered every 300s, thus even
> all new bytes are re-dirtied, it should only cause a less than 300M
> write burst every 300s, not a consistent write.
>
>>
>>> Are you using commit= mount option? Which would reduce the commit
>>> interval thus trigger autodefrag more frequently.
>>
>> I am not using commit= mount option.
>>
>>>> CPU utilization on an otherwise idle machine is approximately 600% all
>>>> the time: btrfs-cleaner 100%, kworkers...btrfs 500%.
>>>
>>> The problem is why the CPU usage is at 100% for cleaner.
>>>
>>> Would you please apply this patch on your kernel?
>>> https://patchwork.kernel.org/project/linux-btrfs/patch/bf2635d213e0c85251c4cd0391d8fbf274d7d637.1645705266.git.wqu@suse.com/
>>>
>>>
>>> Then enable the following trace events...
>>
>> I will try to apply the patch, collect the events and post the
>> results. First, I will wait for the sqlite file to gain about 1
>> million extents, which shouldn't take too long.
>
> Thank you very much for the future trace events log.
>
> That would be the determining data for us to solve it.

Forgot to mention that, that patch itself relies on refactors in the
previous patches.

Thus you may want to apply the whole patchset.

Or use the attached diff which I manually backported for v5.16.12.

Thanks,
Qu
>
>>
>> ----
>>
>> BTW: "compsize file-with-million-extents" finishes in 0.2 seconds
>> (uses BTRFS_IOC_TREE_SEARCH_V2 ioctl), but "filefrag
>> file-with-million-extents" doesn't finish even after several minutes
>> of time (uses FS_IOC_FIEMAP ioctl - manages to perform only about 5
>> ioctl syscalls per second - and appears to be slowing down as the
>> value of the "fm_start" ioctl argument grows; e2fsprogs version
>> 1.46.5). It would be nice if filefrag was faster than just a few
>> ioctls per second.
>
> This is mostly a race with autodefrag.
>
> Both are using file extent map, thus if autodefrag is still trying to
> redirty the file again and again, it would definitely cause problems for
> anything also using file extent map.
>
> Thanks,
> Qu
>>
>> ----
>>
>> Sincerely
>> Jan

[-- Attachment #2: 0001-btrfs-add-trace-events-for-defrag.patch --]
[-- Type: text/x-patch, Size: 8043 bytes --]

From 757bf0aa39c44fc7c3e8e57f1c785ab6c7cffa8a Mon Sep 17 00:00:00 2001
Message-Id: <757bf0aa39c44fc7c3e8e57f1c785ab6c7cffa8a.1646638257.git.wqu@suse.com>
From: Qu Wenruo <wqu@suse.com>
Date: Sun, 13 Feb 2022 14:19:20 +0800
Subject: [PATCH] btrfs: add trace events for defrag

This is the backport for v5.16.12, without the dependency on the
btrfs_defrag_ctrl refactor.

This patch will introduce the following trace events:

- trace_defrag_add_target()
- trace_defrag_one_locked_range()
- trace_defrag_file_start()
- trace_defrag_file_end()

Under most cases, all of them are needed to debug policy related defrag
bugs.

The example output would look like this: (with TASK, CPU, TIMESTAMP and
UUID skipped)

 defrag_file_start: <UUID>: root=5 ino=257 start=0 len=131072 extent_thresh=262144 newer_than=7 flags=0x0 compress=0 max_sectors_to_defrag=1024
 defrag_add_target: <UUID>: root=5 ino=257 target_start=0 target_len=4096 found em=0 len=4096 generation=7
 defrag_add_target: <UUID>: root=5 ino=257 target_start=4096 target_len=4096 found em=4096 len=4096 generation=7
...
 defrag_add_target: <UUID>: root=5 ino=257 target_start=57344 target_len=4096 found em=57344 len=4096 generation=7
 defrag_add_target: <UUID>: root=5 ino=257 target_start=61440 target_len=4096 found em=61440 len=4096 generation=7
 defrag_add_target: <UUID>: root=5 ino=257 target_start=0 target_len=4096 found em=0 len=4096 generation=7
 defrag_add_target: <UUID>: root=5 ino=257 target_start=4096 target_len=4096 found em=4096 len=4096 generation=7
...
 defrag_add_target: <UUID>: root=5 ino=257 target_start=57344 target_len=4096 found em=57344 len=4096 generation=7
 defrag_add_target: <UUID>: root=5 ino=257 target_start=61440 target_len=4096 found em=61440 len=4096 generation=7
 defrag_one_locked_range: <UUID>: root=5 ino=257 start=0 len=65536
 defrag_file_end: <UUID>: root=5 ino=257 sectors_defragged=16 last_scanned=131072 ret=0

Although the defrag_add_target() part is lengthy, it shows some details
of the extent map we get.
With the extra info from defrag_file_start(), we can check if the target
em is correct for our defrag policy.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ioctl.c             |   6 ++
 include/trace/events/btrfs.h | 128 +++++++++++++++++++++++++++++++++++
 2 files changed, 134 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 541a4fbfd79e..622d10ac3e97 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1272,6 +1272,7 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
 add:
 		last_is_target = true;
 		range_len = min(extent_map_end(em), start + len) - cur;
+		trace_defrag_add_target(inode, em, cur, range_len);
 		/*
 		 * This one is a good target, check if it can be merged into
 		 * last range of the target list.
@@ -1366,6 +1367,7 @@ static int defrag_one_locked_target(struct btrfs_inode *inode,
 	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start, len);
 	if (ret < 0)
 		return ret;
+	trace_defrag_one_locked_range(inode, start, (u32)len);
 	clear_extent_bit(&inode->io_tree, start, start + len - 1,
 			 EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
 			 EXTENT_DEFRAG, 0, 0, cached_state);
@@ -1591,6 +1593,9 @@ int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra,
 	/* Align the range */
 	cur = round_down(range->start, fs_info->sectorsize);
 	last_byte = round_up(last_byte, fs_info->sectorsize) - 1;
+	trace_defrag_file_start(BTRFS_I(inode), cur, last_byte + 1 - cur,
+				extent_thresh, newer_than, max_to_defrag,
+				range->flags, range->compress_type);
 
 	/*
 	 * If we were not given a ra, allocate a readahead context. As
@@ -1690,6 +1695,7 @@ int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra,
 		BTRFS_I(inode)->defrag_compress = BTRFS_COMPRESS_NONE;
 		btrfs_inode_unlock(inode, 0);
 	}
+	trace_defrag_file_end(BTRFS_I(inode), ret, sectors_defragged, cur);
 	return ret;
 }
 
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 8f58fd95efc7..98eb8f4a04c6 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -2263,6 +2263,134 @@ DEFINE_EVENT(btrfs__space_info_update, update_bytes_pinned,
 	TP_ARGS(fs_info, sinfo, old, diff)
 );
 
+TRACE_EVENT(defrag_one_locked_range,
+
+	TP_PROTO(const struct btrfs_inode *inode, u64 start, u32 len),
+
+	TP_ARGS(inode, start, len),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	root		)
+		__field(	u64,	ino		)
+		__field(	u64,	start		)
+		__field(	u32,	len		)
+	),
+
+	TP_fast_assign_btrfs(inode->root->fs_info,
+		__entry->root	= inode->root->root_key.objectid;
+		__entry->ino	= btrfs_ino(inode);
+		__entry->start	= start;
+		__entry->len	= len;
+	),
+
+	TP_printk_btrfs("root=%llu ino=%llu start=%llu len=%u",
+		__entry->root, __entry->ino, __entry->start, __entry->len)
+);
+
+TRACE_EVENT(defrag_add_target,
+
+	TP_PROTO(const struct btrfs_inode *inode, const struct extent_map *em,
+		 u64 start, u32 len),
+
+	TP_ARGS(inode, em, start, len),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	root		)
+		__field(	u64,	ino		)
+		__field(	u64,	target_start	)
+		__field(	u32,	target_len	)
+		__field(	u64,	em_generation	)
+		__field(	u64,	em_start	)
+		__field(	u64,	em_len		)
+	),
+
+	TP_fast_assign_btrfs(inode->root->fs_info,
+		__entry->root		= inode->root->root_key.objectid;
+		__entry->ino		= btrfs_ino(inode);
+		__entry->target_start	= start;
+		__entry->target_len	= len;
+		__entry->em_generation	= em->generation;
+		__entry->em_start	= em->start;
+		__entry->em_len		= em->len;
+	),
+
+	TP_printk_btrfs("root=%llu ino=%llu target_start=%llu target_len=%u "
+		"found em=%llu len=%llu generation=%llu",
+		__entry->root, __entry->ino, __entry->target_start,
+		__entry->target_len, __entry->em_start, __entry->em_len,
+		__entry->em_generation)
+);
+
+TRACE_EVENT(defrag_file_start,
+
+	TP_PROTO(const struct btrfs_inode *inode,
+		 u64 start, u64 len, u32 extent_thresh, u64 newer_than,
+		 unsigned long max_sectors_to_defrag, u64 flags, u32 compress),
+
+	TP_ARGS(inode, start, len, extent_thresh, newer_than,
+		max_sectors_to_defrag, flags, compress),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	root			)
+		__field(	u64,	ino			)
+		__field(	u64,	start			)
+		__field(	u64,	len			)
+		__field(	u64,	newer_than		)
+		__field(	u64,	max_sectors_to_defrag	)
+		__field(	u32,	extent_thresh		)
+		__field(	u8,	flags			)
+		__field(	u8,	compress		)
+	),
+
+	TP_fast_assign_btrfs(inode->root->fs_info,
+		__entry->root		= inode->root->root_key.objectid;
+		__entry->ino		= btrfs_ino(inode);
+		__entry->start		= start;
+		__entry->len		= len;
+		__entry->extent_thresh	= extent_thresh;
+		__entry->newer_than	= newer_than;
+		__entry->max_sectors_to_defrag = max_sectors_to_defrag;
+		__entry->flags		= flags;
+		__entry->compress	= compress;
+	),
+
+	TP_printk_btrfs("root=%llu ino=%llu start=%llu len=%llu "
+		"extent_thresh=%u newer_than=%llu flags=0x%x compress=%u "
+		"max_sectors_to_defrag=%llu",
+		__entry->root, __entry->ino, __entry->start, __entry->len,
+		__entry->extent_thresh, __entry->newer_than, __entry->flags,
+		__entry->compress, __entry->max_sectors_to_defrag)
+);
+
+TRACE_EVENT(defrag_file_end,
+
+	TP_PROTO(const struct btrfs_inode *inode,
+		 int ret, u64 sectors_defragged, u64 last_scanned),
+
+	TP_ARGS(inode, ret, sectors_defragged, last_scanned),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	root			)
+		__field(	u64,	ino			)
+		__field(	u64,	sectors_defragged	)
+		__field(	u64,	last_scanned		)
+		__field(	int,	ret			)
+	),
+
+	TP_fast_assign_btrfs(inode->root->fs_info,
+		__entry->root		= inode->root->root_key.objectid;
+		__entry->ino		= btrfs_ino(inode);
+		__entry->sectors_defragged = sectors_defragged;
+		__entry->last_scanned	= last_scanned;
+		__entry->ret		= ret;
+	),
+
+	TP_printk_btrfs("root=%llu ino=%llu sectors_defragged=%llu "
+		"last_scanned=%llu ret=%d",
+		__entry->root, __entry->ino, __entry->sectors_defragged,
+		__entry->last_scanned, __entry->ret)
+);
+
 #endif /* _TRACE_BTRFS_H */
 
 /* This part must be outside protection */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-07  7:31       ` Qu Wenruo
@ 2022-03-10  1:10         ` Jan Ziak
  2022-03-10  1:26           ` Qu Wenruo
  0 siblings, 1 reply; 71+ messages in thread
From: Jan Ziak @ 2022-03-10  1:10 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

> Or use the attached diff which I manually backported for v5.16.12.

I applied the patch to 5.16.12. It takes about 35 minutes after "mount
/ -o remount,autodefrag" for btrfs autodefrag to start writing about
200 MB/s to the NVMe drive.

$ trace-cmd record -e btrfs:defrag_*

The size of the resulting trace.dat file is 4 GB.

Please send me some instructions describing how to extract data
relevant to the btrfs-autodefrag issue from the trace.dat file. I
suppose you don't want the whole trace.dat file. Compressed
trace.dat.zstd has size 324 MB.

-Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-10  1:10         ` Jan Ziak
@ 2022-03-10  1:26           ` Qu Wenruo
  2022-03-10  4:33             ` Jan Ziak
  0 siblings, 1 reply; 71+ messages in thread
From: Qu Wenruo @ 2022-03-10  1:26 UTC (permalink / raw)
  To: Jan Ziak, Qu Wenruo; +Cc: linux-btrfs



On 2022/3/10 09:10, Jan Ziak wrote:
>> Or use the attached diff which I manually backported for v5.16.12.
>
> I applied the patch to 5.16.12. It takes about 35 minutes after "mount
> / -o remount,autodefrag" for btrfs autodefrag to start writing about
> 200 MB/s to the NVMe drive.
>
> $ trace-cmd record -e btrfs:defrag_*

You can go without using trace-cmd, and use sysfs interface directly.
(That's why sometimes trace-cmd is over-complicating things)

This would not only reduce the size of the file, but also provide a
readable result directly (all need root privilege)

cd /sys/kernel/debug/tracing

## To disable and clear current trace buffer and events

echo 0 > tracing_on
echo > trace
echo > set_event

## Reduce per-cpu buffer size in KB, if you don't want a too large
## event buffer

echo 64 > buffer_size_kb

## Enable those defrag events:

echo "btrfs:defrag_one_locked_range" >> set_event
echo "btrfs:defrag_add_target" >> set_event
echo "btrfs:defrag_file_start" >> set_event
echo "btrfs:defrag_file_end" >> set_event

## Enable trace

echo 1 > $tracedir/tracing_on

## After the consistent write happens, just copy the trace file

cp /sys/kernel/debug/tracing/trace /tmp/whatevername

Thanks,
Qu

>
> The size of the resulting trace.dat file is 4 GB.
>
> Please send me some instructions describing how to extract data
> relevant to the btrfs-autodefrag issue from the trace.dat file. I
> suppose you don't want the whole trace.dat file. Compressed
> trace.dat.zstd has size 324 MB.
>
> -Jan
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-10  1:26           ` Qu Wenruo
@ 2022-03-10  4:33             ` Jan Ziak
  2022-03-10  6:42               ` Qu Wenruo
  0 siblings, 1 reply; 71+ messages in thread
From: Jan Ziak @ 2022-03-10  4:33 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 492 bytes --]

On Thu, Mar 10, 2022 at 2:26 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> ## Enable trace
>
> echo 1 > $tracedir/tracing_on
>
> ## After the consistent write happens, just copy the trace file
>
> cp /sys/kernel/debug/tracing/trace /tmp/whatevername

The compressed trace is attached to this email. Inode 307273 is the
40GB sqlite file, it currently has 1689020 extents. This time, it took
about 3 hours after "mount / -o remount,autodefrag" for the issue to
start manifesting itself.

-Jan

[-- Attachment #2: trace.txt.zst --]
[-- Type: application/zstd, Size: 86535 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-10  4:33             ` Jan Ziak
@ 2022-03-10  6:42               ` Qu Wenruo
  2022-03-10 21:31                 ` Jan Ziak
  0 siblings, 1 reply; 71+ messages in thread
From: Qu Wenruo @ 2022-03-10  6:42 UTC (permalink / raw)
  To: Jan Ziak; +Cc: linux-btrfs



On 2022/3/10 12:33, Jan Ziak wrote:
> On Thu, Mar 10, 2022 at 2:26 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> ## Enable trace
>>
>> echo 1 > $tracedir/tracing_on
>>
>> ## After the consistent write happens, just copy the trace file
>>
>> cp /sys/kernel/debug/tracing/trace /tmp/whatevername
>
> The compressed trace is attached to this email. Inode 307273 is the
> 40GB sqlite file, it currently has 1689020 extents. This time, it took
> about 3 hours after "mount / -o remount,autodefrag" for the issue to
> start manifesting itself.

Sorry, considering your sqlite file is so large, there are too many
defrag_one_locked_range() and defrag_add_target() calls.

And the buffer size is a little too small.

Mind to re-take the trace with the following commands?
(No need to reboot, it takes effect immediate)

cd /sys/kernel/debug/tracing
echo 0 > tracing_on
echo > trace
echo > set_event
echo 65536 > buffer_size_kb
echo "btrfs:defrag_file_start" >> set_event
echo "btrfs:defrag_file_end" >> set_event
echo 1 > $tracedir/tracing_on

## After the consistent write happens, just copy the trace file

cp /sys/kernel/debug/tracing/trace /tmp/whatevername

Thanks,
Qu
>
> -Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-10  6:42               ` Qu Wenruo
@ 2022-03-10 21:31                 ` Jan Ziak
  2022-03-10 23:27                   ` Qu Wenruo
  0 siblings, 1 reply; 71+ messages in thread
From: Jan Ziak @ 2022-03-10 21:31 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Mar 10, 2022 at 7:42 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > The compressed trace is attached to this email. Inode 307273 is the
> > 40GB sqlite file, it currently has 1689020 extents. This time, it took
> > about 3 hours after "mount / -o remount,autodefrag" for the issue to
> > start manifesting itself.
>
> Sorry, considering your sqlite file is so large, there are too many
> defrag_one_locked_range() and defrag_add_target() calls.
>
> And the buffer size is a little too small.
>
> Mind to re-take the trace with the following commands?

The compressed trace (size: 1.8 MB) can be downloaded from
http://atom-symbol.net/f/2022-03-10/btrfs-autodefrag-trace.txt.zst

According to compsize:

- inode 307273, at the start of the trace: 1783756 regular extents
(3045856 refs), 0 inline

- inode 307273, at the end of the trace: 1787794 regular extents
(3054334 refs), 0 inline

- inode 307273, delta: +4038 regular extents (+8478 refs)

Approximately 85% of lines in the trace are related to the mentioned
inode, which means that btrfs-autodefrag is trying to defragment the
file. The main issue, in my opinion, is that the number of extents
increased by 4038 despite btrfs's defragmentation attempts.

-Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-10 21:31                 ` Jan Ziak
@ 2022-03-10 23:27                   ` Qu Wenruo
  2022-03-11  2:42                     ` Jan Ziak
  0 siblings, 1 reply; 71+ messages in thread
From: Qu Wenruo @ 2022-03-10 23:27 UTC (permalink / raw)
  To: Jan Ziak; +Cc: linux-btrfs



On 2022/3/11 05:31, Jan Ziak wrote:
> On Thu, Mar 10, 2022 at 7:42 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>> The compressed trace is attached to this email. Inode 307273 is the
>>> 40GB sqlite file, it currently has 1689020 extents. This time, it took
>>> about 3 hours after "mount / -o remount,autodefrag" for the issue to
>>> start manifesting itself.
>>
>> Sorry, considering your sqlite file is so large, there are too many
>> defrag_one_locked_range() and defrag_add_target() calls.
>>
>> And the buffer size is a little too small.
>>
>> Mind to re-take the trace with the following commands?
>
> The compressed trace (size: 1.8 MB) can be downloaded from
> http://atom-symbol.net/f/2022-03-10/btrfs-autodefrag-trace.txt.zst
>
> According to compsize:
>
> - inode 307273, at the start of the trace: 1783756 regular extents
> (3045856 refs), 0 inline
>
> - inode 307273, at the end of the trace: 1787794 regular extents
> (3054334 refs), 0 inline
>
> - inode 307273, delta: +4038 regular extents (+8478 refs)

The trace results shows a pattern in the beginning, that around every
30s, autodefrag scans that inode once:

  67292.784930: defrag_file_start: root=5 ino=307273 start=0
len=42705735680 extent_thresh=65536
  67323.655798: defrag_file_start: root=5 ino=307273 start=0
len=42706268160 extent_thresh=65536
  67354.126797: defrag_file_start: root=5 ino=307273 start=0
len=42706268160 extent_thresh=65536
  67358.865643: defrag_file_start: root=5 ino=307273 start=0
len=42706268160 extent_thresh=65536
  67385.190417: defrag_file_start: root=5 ino=307273 start=0
len=42706554880 extent_thresh=65536
  67415.960153: defrag_file_start: root=5 ino=307273 start=0
len=42706554880 extent_thresh=65536
  67446.798930: defrag_file_start: root=5 ino=307273 start=0
len=42707038208 extent_thresh=65536

This part is the expected behavior.

But very soon, the autodefrag is scanning the file again and again in a
very short time:

  69188.802624: defrag_file_start: root=5 ino=307273 start=0
len=42720563200 extent_thresh=65536
  69189.235753: defrag_file_start: root=5 ino=307273 start=0
len=42720563200 extent_thresh=65536
  69189.896309: defrag_file_start: root=5 ino=307273 start=0
len=42720563200 extent_thresh=65536
  69190.594834: defrag_file_start: root=5 ino=307273 start=0
len=42720563200 extent_thresh=65536
  69191.185359: defrag_file_start: root=5 ino=307273 start=0
len=42720563200 extent_thresh=65536
  69191.543833: defrag_file_start: root=5 ino=307273 start=0
len=42720563200 extent_thresh=65536
  69192.275865: defrag_file_start: root=5 ino=307273 start=0
len=42720563200 extent_thresh=65536

That inode get defragged 7 times in just 5 seconds.
There are more similar patterns for the same inode.

The unexpected behavior is the same reported by another reporter.
(https://github.com/btrfs/linux/issues/423#issuecomment-1062338536)

Thus this patch should resolve the repeated defrag behavior:
https://patchwork.kernel.org/project/linux-btrfs/patch/318a1bcdabdd1218d631ddb1a6fe1b9ca3b6b529.1646782687.git.wqu@suse.com/

Mind to give it a try?

>
> Approximately 85% of lines in the trace are related to the mentioned
> inode, which means that btrfs-autodefrag is trying to defragment the
> file. The main issue, in my opinion, is that the number of extents
> increased by 4038 despite btrfs's defragmentation attempts.

Well, this is a trade-off between the effectiveness of defrag and IO.

Previously we have a larger extent threshold for autodefrag (256K vs 64K
now).
However that larger threshold will cause even more IO.

In the near future (hopefully v5.19), we will introduce more fine tuning
for autodefrag (allowing users to specify the autodefrag interval, and
target extent threshold).

Thanks,
Qu

>
> -Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-10 23:27                   ` Qu Wenruo
@ 2022-03-11  2:42                     ` Jan Ziak
  2022-03-11  2:59                       ` Qu Wenruo
  0 siblings, 1 reply; 71+ messages in thread
From: Jan Ziak @ 2022-03-11  2:42 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Mar 11, 2022 at 12:27 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> The unexpected behavior is the same reported by another reporter.
> (https://github.com/btrfs/linux/issues/423#issuecomment-1062338536)
>
> Thus this patch should resolve the repeated defrag behavior:
> https://patchwork.kernel.org/project/linux-btrfs/patch/318a1bcdabdd1218d631ddb1a6fe1b9ca3b6b529.1646782687.git.wqu@suse.com/
>
> Mind to give it a try?

New trace (patched kernel):
http://atom-symbol.net/f/2022-03-11/btrfs-autodefrag-trace-patch1.txt.zst

$ cat /proc/297/io
read_bytes: 217_835_884_544
write_bytes: 319_139_635_200

btrfs-cleaner (pid 297) read 217 GB and wrote 319 GB, but this had no
effect on the fragmentation of the file (currently 1810562 extents).

The CPU time of btrfs-cleaner is 20m22s. Machine uptime is 3h27m.

-Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-11  2:42                     ` Jan Ziak
@ 2022-03-11  2:59                       ` Qu Wenruo
  2022-03-11  5:04                         ` Jan Ziak
  2022-03-14 20:09                         ` Phillip Susi
  0 siblings, 2 replies; 71+ messages in thread
From: Qu Wenruo @ 2022-03-11  2:59 UTC (permalink / raw)
  To: Jan Ziak; +Cc: linux-btrfs



On 2022/3/11 10:42, Jan Ziak wrote:
> On Fri, Mar 11, 2022 at 12:27 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> The unexpected behavior is the same reported by another reporter.
>> (https://github.com/btrfs/linux/issues/423#issuecomment-1062338536)
>>
>> Thus this patch should resolve the repeated defrag behavior:
>> https://patchwork.kernel.org/project/linux-btrfs/patch/318a1bcdabdd1218d631ddb1a6fe1b9ca3b6b529.1646782687.git.wqu@suse.com/
>>
>> Mind to give it a try?
>
> New trace (patched kernel):
> http://atom-symbol.net/f/2022-03-11/btrfs-autodefrag-trace-patch1.txt.zst

Mostly as expected now.

A few outliners can also be fixed by a upcoming patch:
https://patchwork.kernel.org/project/linux-btrfs/patch/d1ce90f37777987732b8ccf0edbfc961cd5c8873.1646912061.git.wqu@suse.com/

But please note that, the extra patch won't bring a bigger impact as the
previous one, it's mostly a small optimization.
>
> $ cat /proc/297/io
> read_bytes: 217_835_884_544
> write_bytes: 319_139_635_200
>
> btrfs-cleaner (pid 297) read 217 GB and wrote 319 GB, but this had no
> effect on the fragmentation of the file (currently 1810562 extents).

That's more or less expected.

Autodefrag has two limitations:

1. Only defrag newer writes
    It doesn't defrag older fragments.
    This is the existing behavior from the beginning of autodefrag.
    Thus it's not that effective against small random writes.

2. Small target extent size
    Only targets writes smaller than 64K.

If 1. is the main reason, even if we allow users to specify the
autodefrag extent size/interval, it won't help too much for the workload.

And I have already submitted patch to btrfs docs, explaining that
autodefrag is not really a perfect fit for heavy small random writes.

Thanks,
Qu

>
> The CPU time of btrfs-cleaner is 20m22s. Machine uptime is 3h27m.
>
> -Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-11  2:59                       ` Qu Wenruo
@ 2022-03-11  5:04                         ` Jan Ziak
  2022-03-11 16:31                           ` Jan Ziak
  2022-03-14 20:09                         ` Phillip Susi
  1 sibling, 1 reply; 71+ messages in thread
From: Jan Ziak @ 2022-03-11  5:04 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Mar 11, 2022 at 3:59 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> A few outliners can also be fixed by a upcoming patch:
> https://patchwork.kernel.org/project/linux-btrfs/patch/d1ce90f37777987732b8ccf0edbfc961cd5c8873.1646912061.git.wqu@suse.com/
>
> But please note that, the extra patch won't bring a bigger impact as the
> previous one, it's mostly a small optimization.

I will apply and test the patch and report results.

-Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-11  5:04                         ` Jan Ziak
@ 2022-03-11 16:31                           ` Jan Ziak
  2022-03-11 20:02                             ` Jan Ziak
  2022-03-11 23:04                             ` Qu Wenruo
  0 siblings, 2 replies; 71+ messages in thread
From: Jan Ziak @ 2022-03-11 16:31 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Mar 11, 2022 at 6:04 AM Jan Ziak <0xe2.0x9a.0x9b@gmail.com> wrote:
>
> On Fri, Mar 11, 2022 at 3:59 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > A few outliners can also be fixed by a upcoming patch:
> > https://patchwork.kernel.org/project/linux-btrfs/patch/d1ce90f37777987732b8ccf0edbfc961cd5c8873.1646912061.git.wqu@suse.com/
> >
> > But please note that, the extra patch won't bring a bigger impact as the
> > previous one, it's mostly a small optimization.
>
> I will apply and test the patch and report results.

$ uptime
10h54m

CPU time of pid 297: 1h48m

$ cat /proc/297/io  (pid 297 is btrfs-cleaner)
read_bytes: 4_433_081_716_736
write_bytes: 788_509_859_840

file.sqlite, before 10h54m: 1827586 extents

file.sqlite, after 10h54m: 1876144 extents

Summary: File fragmentation increased by 48558 extents despite the
fact that btrfs-cleaner read 4.4 TB, wrote 788 GB and consumed 1h48m
of CPU time.

If it helps, I can send you the complete list of all the 1.8 million
extents. I am not sure how long it might take to obtain such a list.

-Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-11 16:31                           ` Jan Ziak
@ 2022-03-11 20:02                             ` Jan Ziak
  2022-03-11 23:04                             ` Qu Wenruo
  1 sibling, 0 replies; 71+ messages in thread
From: Jan Ziak @ 2022-03-11 20:02 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 877 bytes --]

On Fri, Mar 11, 2022 at 5:31 PM Jan Ziak <0xe2.0x9a.0x9b@gmail.com> wrote:
> If it helps, I can send you the complete list of all the 1.8 million
> extents. I am not sure how long it might take to obtain such a list.

It takes only a couple of minutes to get all extents by using xfs_io.
A text file containing the histogram of all the file's extents is
attached to this email.

I suggest we do the following steps:

1. Make a snapshot of the file's extents (extents1.txt)

2. Enable btrfs autodefrag, enable tracing

3.1. Make a snapshot of the file's extents (extents2.txt)

3.2. Save /sys/kernel/debug/tracing/trace (trace.txt)

4. Compute the difference between extents1.txt and extents2.txt

5. Compare extents-diff.txt and trace.txt, in order to determine why
the number of extents is increasing over time despite
btrfs-autodefrag's attempts to defragment the file

-Jan

[-- Attachment #2: extents-histogram.txt --]
[-- Type: text/plain, Size: 22650 bytes --]

xfs_io -c "fiemap -v 0g 100g": 3194386 extents

1 block = 512 bytes

8 blocks: 2216051 instances
16 blocks: 302377 instances
24 blocks: 189380 instances
32 blocks: 117976 instances
40 blocks: 80658 instances
48 blocks: 58107 instances
56 blocks: 42171 instances
64 blocks: 31595 instances
72 blocks: 24612 instances
80 blocks: 19104 instances
88 blocks: 15051 instances
96 blocks: 12393 instances
104 blocks: 10114 instances
112 blocks: 8555 instances
120 blocks: 7147 instances
128 blocks: 5910 instances
136 blocks: 4991 instances
144 blocks: 4187 instances
152 blocks: 3488 instances
160 blocks: 2928 instances
168 blocks: 2499 instances
176 blocks: 2184 instances
184 blocks: 1816 instances
192 blocks: 1611 instances
200 blocks: 1416 instances
208 blocks: 1242 instances
216 blocks: 1079 instances
224 blocks: 938 instances
232 blocks: 809 instances
240 blocks: 766 instances
248 blocks: 669 instances
256 blocks: 614 instances
264 blocks: 571 instances
272 blocks: 528 instances
280 blocks: 494 instances
288 blocks: 423 instances
296 blocks: 383 instances
304 blocks: 359 instances
312 blocks: 365 instances
320 blocks: 310 instances
328 blocks: 306 instances
336 blocks: 241 instances
344 blocks: 271 instances
352 blocks: 260 instances
360 blocks: 289 instances
368 blocks: 278 instances
376 blocks: 244 instances
384 blocks: 232 instances
392 blocks: 262 instances
400 blocks: 202 instances
408 blocks: 212 instances
416 blocks: 218 instances
424 blocks: 215 instances
432 blocks: 192 instances
440 blocks: 209 instances
448 blocks: 198 instances
456 blocks: 195 instances
464 blocks: 176 instances
472 blocks: 206 instances
480 blocks: 172 instances
488 blocks: 178 instances
496 blocks: 177 instances
504 blocks: 158 instances
512 blocks: 181 instances
520 blocks: 174 instances
528 blocks: 152 instances
536 blocks: 146 instances
544 blocks: 146 instances
552 blocks: 151 instances
560 blocks: 145 instances
568 blocks: 158 instances
576 blocks: 136 instances
584 blocks: 133 instances
592 blocks: 140 instances
600 blocks: 131 instances
608 blocks: 123 instances
616 blocks: 122 instances
624 blocks: 113 instances
632 blocks: 133 instances
640 blocks: 118 instances
648 blocks: 139 instances
656 blocks: 137 instances
664 blocks: 128 instances
672 blocks: 119 instances
680 blocks: 101 instances
688 blocks: 98 instances
696 blocks: 116 instances
704 blocks: 101 instances
712 blocks: 119 instances
720 blocks: 106 instances
728 blocks: 107 instances
736 blocks: 95 instances
744 blocks: 118 instances
752 blocks: 97 instances
760 blocks: 110 instances
768 blocks: 109 instances
776 blocks: 89 instances
784 blocks: 96 instances
792 blocks: 79 instances
800 blocks: 90 instances
808 blocks: 81 instances
816 blocks: 88 instances
824 blocks: 87 instances
832 blocks: 86 instances
840 blocks: 81 instances
848 blocks: 93 instances
856 blocks: 90 instances
864 blocks: 85 instances
872 blocks: 75 instances
880 blocks: 79 instances
888 blocks: 92 instances
896 blocks: 92 instances
904 blocks: 73 instances
912 blocks: 87 instances
920 blocks: 89 instances
928 blocks: 93 instances
936 blocks: 90 instances
944 blocks: 85 instances
952 blocks: 93 instances
960 blocks: 62 instances
968 blocks: 66 instances
976 blocks: 65 instances
984 blocks: 86 instances
992 blocks: 76 instances
1000 blocks: 77 instances
1008 blocks: 62 instances
1016 blocks: 66 instances
1024 blocks: 74 instances
1032 blocks: 85 instances
1040 blocks: 70 instances
1048 blocks: 56 instances
1056 blocks: 84 instances
1064 blocks: 64 instances
1072 blocks: 64 instances
1080 blocks: 72 instances
1088 blocks: 53 instances
1096 blocks: 59 instances
1104 blocks: 50 instances
1112 blocks: 46 instances
1120 blocks: 49 instances
1128 blocks: 68 instances
1136 blocks: 59 instances
1144 blocks: 57 instances
1152 blocks: 53 instances
1160 blocks: 52 instances
1168 blocks: 46 instances
1176 blocks: 55 instances
1184 blocks: 50 instances
1192 blocks: 60 instances
1200 blocks: 47 instances
1208 blocks: 38 instances
1216 blocks: 61 instances
1224 blocks: 56 instances
1232 blocks: 66 instances
1240 blocks: 57 instances
1248 blocks: 45 instances
1256 blocks: 56 instances
1264 blocks: 47 instances
1272 blocks: 46 instances
1280 blocks: 50 instances
1288 blocks: 45 instances
1296 blocks: 44 instances
1304 blocks: 53 instances
1312 blocks: 45 instances
1320 blocks: 61 instances
1328 blocks: 42 instances
1336 blocks: 46 instances
1344 blocks: 52 instances
1352 blocks: 50 instances
1360 blocks: 46 instances
1368 blocks: 39 instances
1376 blocks: 53 instances
1384 blocks: 41 instances
1392 blocks: 55 instances
1400 blocks: 33 instances
1408 blocks: 46 instances
1416 blocks: 29 instances
1424 blocks: 34 instances
1432 blocks: 41 instances
1440 blocks: 30 instances
1448 blocks: 38 instances
1456 blocks: 29 instances
1464 blocks: 34 instances
1472 blocks: 34 instances
1480 blocks: 31 instances
1488 blocks: 38 instances
1496 blocks: 34 instances
1504 blocks: 41 instances
1512 blocks: 34 instances
1520 blocks: 25 instances
1528 blocks: 24 instances
1536 blocks: 32 instances
1544 blocks: 32 instances
1552 blocks: 24 instances
1560 blocks: 27 instances
1568 blocks: 29 instances
1576 blocks: 29 instances
1584 blocks: 24 instances
1592 blocks: 26 instances
1600 blocks: 33 instances
1608 blocks: 29 instances
1616 blocks: 32 instances
1624 blocks: 33 instances
1632 blocks: 32 instances
1640 blocks: 24 instances
1648 blocks: 25 instances
1656 blocks: 20 instances
1664 blocks: 23 instances
1672 blocks: 28 instances
1680 blocks: 20 instances
1688 blocks: 30 instances
1696 blocks: 27 instances
1704 blocks: 16 instances
1712 blocks: 31 instances
1720 blocks: 20 instances
1728 blocks: 19 instances
1736 blocks: 20 instances
1744 blocks: 22 instances
1752 blocks: 19 instances
1760 blocks: 24 instances
1768 blocks: 23 instances
1776 blocks: 23 instances
1784 blocks: 25 instances
1792 blocks: 21 instances
1800 blocks: 17 instances
1808 blocks: 21 instances
1816 blocks: 24 instances
1824 blocks: 14 instances
1832 blocks: 11 instances
1840 blocks: 21 instances
1848 blocks: 19 instances
1856 blocks: 18 instances
1864 blocks: 9 instances
1872 blocks: 19 instances
1880 blocks: 16 instances
1888 blocks: 19 instances
1896 blocks: 15 instances
1904 blocks: 22 instances
1912 blocks: 18 instances
1920 blocks: 14 instances
1928 blocks: 10 instances
1936 blocks: 15 instances
1944 blocks: 17 instances
1952 blocks: 19 instances
1960 blocks: 16 instances
1968 blocks: 20 instances
1976 blocks: 23 instances
1984 blocks: 19 instances
1992 blocks: 19 instances
2000 blocks: 17 instances
2008 blocks: 15 instances
2016 blocks: 14 instances
2024 blocks: 15 instances
2032 blocks: 25 instances
2040 blocks: 17 instances
2048 blocks: 15 instances
2056 blocks: 10 instances
2064 blocks: 13 instances
2072 blocks: 29 instances
2080 blocks: 16 instances
2088 blocks: 14 instances
2096 blocks: 17 instances
2104 blocks: 12 instances
2112 blocks: 13 instances
2120 blocks: 17 instances
2128 blocks: 19 instances
2136 blocks: 14 instances
2144 blocks: 12 instances
2152 blocks: 9 instances
2160 blocks: 7 instances
2168 blocks: 8 instances
2176 blocks: 17 instances
2184 blocks: 10 instances
2192 blocks: 13 instances
2200 blocks: 14 instances
2208 blocks: 17 instances
2216 blocks: 14 instances
2224 blocks: 14 instances
2232 blocks: 11 instances
2240 blocks: 14 instances
2248 blocks: 10 instances
2256 blocks: 11 instances
2264 blocks: 14 instances
2272 blocks: 20 instances
2280 blocks: 10 instances
2288 blocks: 7 instances
2296 blocks: 13 instances
2304 blocks: 9 instances
2312 blocks: 13 instances
2320 blocks: 9 instances
2328 blocks: 6 instances
2336 blocks: 10 instances
2344 blocks: 11 instances
2352 blocks: 11 instances
2360 blocks: 11 instances
2368 blocks: 6 instances
2376 blocks: 17 instances
2384 blocks: 10 instances
2392 blocks: 7 instances
2400 blocks: 10 instances
2408 blocks: 5 instances
2416 blocks: 9 instances
2424 blocks: 9 instances
2432 blocks: 11 instances
2440 blocks: 14 instances
2448 blocks: 12 instances
2456 blocks: 16 instances
2464 blocks: 10 instances
2472 blocks: 9 instances
2480 blocks: 7 instances
2488 blocks: 6 instances
2496 blocks: 11 instances
2504 blocks: 13 instances
2512 blocks: 9 instances
2520 blocks: 8 instances
2528 blocks: 6 instances
2536 blocks: 14 instances
2544 blocks: 7 instances
2552 blocks: 9 instances
2560 blocks: 10 instances
2568 blocks: 11 instances
2576 blocks: 7 instances
2584 blocks: 10 instances
2592 blocks: 14 instances
2600 blocks: 15 instances
2608 blocks: 11 instances
2616 blocks: 8 instances
2624 blocks: 6 instances
2632 blocks: 12 instances
2640 blocks: 12 instances
2648 blocks: 5 instances
2656 blocks: 10 instances
2664 blocks: 7 instances
2672 blocks: 9 instances
2680 blocks: 7 instances
2688 blocks: 7 instances
2696 blocks: 9 instances
2704 blocks: 8 instances
2712 blocks: 8 instances
2720 blocks: 9 instances
2728 blocks: 11 instances
2736 blocks: 7 instances
2744 blocks: 5 instances
2752 blocks: 11 instances
2760 blocks: 10 instances
2768 blocks: 4 instances
2776 blocks: 9 instances
2784 blocks: 6 instances
2792 blocks: 6 instances
2800 blocks: 9 instances
2808 blocks: 10 instances
2816 blocks: 9 instances
2824 blocks: 7 instances
2832 blocks: 6 instances
2840 blocks: 12 instances
2848 blocks: 4 instances
2856 blocks: 7 instances
2864 blocks: 7 instances
2872 blocks: 7 instances
2880 blocks: 3 instances
2888 blocks: 6 instances
2896 blocks: 9 instances
2904 blocks: 8 instances
2912 blocks: 3 instances
2920 blocks: 5 instances
2928 blocks: 9 instances
2936 blocks: 6 instances
2944 blocks: 4 instances
2952 blocks: 7 instances
2960 blocks: 3 instances
2968 blocks: 3 instances
2976 blocks: 2 instances
2984 blocks: 3 instances
2992 blocks: 10 instances
3000 blocks: 4 instances
3008 blocks: 4 instances
3016 blocks: 3 instances
3024 blocks: 3 instances
3032 blocks: 1 instances
3040 blocks: 6 instances
3048 blocks: 4 instances
3056 blocks: 10 instances
3064 blocks: 3 instances
3072 blocks: 2 instances
3080 blocks: 8 instances
3088 blocks: 5 instances
3096 blocks: 7 instances
3104 blocks: 2 instances
3112 blocks: 6 instances
3120 blocks: 3 instances
3128 blocks: 6 instances
3136 blocks: 5 instances
3144 blocks: 3 instances
3152 blocks: 2 instances
3160 blocks: 3 instances
3168 blocks: 4 instances
3176 blocks: 6 instances
3184 blocks: 5 instances
3192 blocks: 8 instances
3200 blocks: 3 instances
3208 blocks: 6 instances
3216 blocks: 5 instances
3224 blocks: 3 instances
3232 blocks: 7 instances
3240 blocks: 2 instances
3248 blocks: 5 instances
3256 blocks: 4 instances
3264 blocks: 4 instances
3272 blocks: 3 instances
3288 blocks: 3 instances
3296 blocks: 5 instances
3304 blocks: 6 instances
3312 blocks: 3 instances
3320 blocks: 8 instances
3328 blocks: 1 instances
3336 blocks: 4 instances
3344 blocks: 4 instances
3352 blocks: 6 instances
3360 blocks: 1 instances
3368 blocks: 3 instances
3376 blocks: 3 instances
3384 blocks: 5 instances
3392 blocks: 2 instances
3400 blocks: 1 instances
3408 blocks: 5 instances
3416 blocks: 6 instances
3424 blocks: 9 instances
3432 blocks: 1 instances
3440 blocks: 1 instances
3448 blocks: 3 instances
3456 blocks: 2 instances
3464 blocks: 4 instances
3472 blocks: 2 instances
3480 blocks: 5 instances
3488 blocks: 2 instances
3496 blocks: 4 instances
3504 blocks: 10 instances
3512 blocks: 2 instances
3520 blocks: 3 instances
3528 blocks: 3 instances
3536 blocks: 1 instances
3544 blocks: 2 instances
3552 blocks: 3 instances
3560 blocks: 2 instances
3568 blocks: 4 instances
3576 blocks: 3 instances
3584 blocks: 3 instances
3592 blocks: 5 instances
3600 blocks: 5 instances
3608 blocks: 3 instances
3616 blocks: 3 instances
3624 blocks: 4 instances
3632 blocks: 4 instances
3640 blocks: 5 instances
3648 blocks: 2 instances
3656 blocks: 4 instances
3664 blocks: 4 instances
3672 blocks: 2 instances
3680 blocks: 4 instances
3688 blocks: 2 instances
3696 blocks: 2 instances
3704 blocks: 1 instances
3712 blocks: 2 instances
3720 blocks: 1 instances
3728 blocks: 3 instances
3736 blocks: 3 instances
3744 blocks: 2 instances
3760 blocks: 6 instances
3768 blocks: 1 instances
3776 blocks: 2 instances
3784 blocks: 1 instances
3792 blocks: 2 instances
3800 blocks: 2 instances
3808 blocks: 3 instances
3816 blocks: 2 instances
3824 blocks: 1 instances
3832 blocks: 3 instances
3840 blocks: 4 instances
3848 blocks: 3 instances
3856 blocks: 2 instances
3864 blocks: 5 instances
3872 blocks: 5 instances
3880 blocks: 3 instances
3888 blocks: 1 instances
3896 blocks: 1 instances
3912 blocks: 2 instances
3920 blocks: 2 instances
3944 blocks: 2 instances
3952 blocks: 3 instances
3968 blocks: 3 instances
3976 blocks: 3 instances
3984 blocks: 2 instances
3992 blocks: 2 instances
4008 blocks: 3 instances
4016 blocks: 1 instances
4024 blocks: 1 instances
4032 blocks: 1 instances
4040 blocks: 1 instances
4048 blocks: 2 instances
4064 blocks: 4 instances
4072 blocks: 4 instances
4088 blocks: 4 instances
4096 blocks: 2 instances
4104 blocks: 4 instances
4112 blocks: 4 instances
4120 blocks: 1 instances
4128 blocks: 1 instances
4136 blocks: 2 instances
4144 blocks: 3 instances
4152 blocks: 1 instances
4160 blocks: 3 instances
4168 blocks: 5 instances
4176 blocks: 1 instances
4184 blocks: 1 instances
4192 blocks: 2 instances
4200 blocks: 1 instances
4208 blocks: 5 instances
4224 blocks: 2 instances
4232 blocks: 2 instances
4240 blocks: 1 instances
4248 blocks: 2 instances
4256 blocks: 6 instances
4272 blocks: 4 instances
4280 blocks: 2 instances
4304 blocks: 3 instances
4312 blocks: 3 instances
4320 blocks: 1 instances
4328 blocks: 2 instances
4336 blocks: 2 instances
4344 blocks: 2 instances
4352 blocks: 1 instances
4360 blocks: 1 instances
4368 blocks: 4 instances
4376 blocks: 1 instances
4392 blocks: 2 instances
4400 blocks: 3 instances
4408 blocks: 2 instances
4416 blocks: 2 instances
4424 blocks: 2 instances
4432 blocks: 1 instances
4440 blocks: 2 instances
4448 blocks: 1 instances
4456 blocks: 1 instances
4464 blocks: 1 instances
4480 blocks: 2 instances
4488 blocks: 2 instances
4496 blocks: 2 instances
4504 blocks: 3 instances
4512 blocks: 3 instances
4520 blocks: 1 instances
4536 blocks: 1 instances
4544 blocks: 4 instances
4552 blocks: 1 instances
4560 blocks: 2 instances
4568 blocks: 2 instances
4584 blocks: 2 instances
4592 blocks: 2 instances
4608 blocks: 2 instances
4624 blocks: 2 instances
4632 blocks: 1 instances
4640 blocks: 2 instances
4656 blocks: 1 instances
4664 blocks: 1 instances
4672 blocks: 1 instances
4680 blocks: 1 instances
4688 blocks: 2 instances
4696 blocks: 1 instances
4712 blocks: 4 instances
4728 blocks: 1 instances
4736 blocks: 1 instances
4744 blocks: 1 instances
4760 blocks: 1 instances
4768 blocks: 1 instances
4776 blocks: 1 instances
4784 blocks: 2 instances
4792 blocks: 4 instances
4800 blocks: 1 instances
4808 blocks: 2 instances
4824 blocks: 3 instances
4832 blocks: 1 instances
4840 blocks: 2 instances
4848 blocks: 2 instances
4856 blocks: 2 instances
4864 blocks: 2 instances
4880 blocks: 1 instances
4896 blocks: 1 instances
4904 blocks: 1 instances
4912 blocks: 2 instances
4944 blocks: 5 instances
4952 blocks: 1 instances
4960 blocks: 2 instances
4968 blocks: 1 instances
4984 blocks: 1 instances
5000 blocks: 1 instances
5008 blocks: 2 instances
5016 blocks: 1 instances
5032 blocks: 1 instances
5040 blocks: 1 instances
5088 blocks: 2 instances
5112 blocks: 1 instances
5120 blocks: 2 instances
5128 blocks: 1 instances
5136 blocks: 3 instances
5152 blocks: 3 instances
5160 blocks: 2 instances
5176 blocks: 1 instances
5184 blocks: 2 instances
5192 blocks: 1 instances
5200 blocks: 2 instances
5208 blocks: 1 instances
5216 blocks: 1 instances
5224 blocks: 2 instances
5232 blocks: 2 instances
5256 blocks: 1 instances
5280 blocks: 1 instances
5288 blocks: 2 instances
5296 blocks: 4 instances
5304 blocks: 1 instances
5320 blocks: 2 instances
5328 blocks: 1 instances
5336 blocks: 1 instances
5344 blocks: 1 instances
5352 blocks: 1 instances
5360 blocks: 1 instances
5376 blocks: 1 instances
5384 blocks: 2 instances
5400 blocks: 1 instances
5408 blocks: 1 instances
5424 blocks: 1 instances
5432 blocks: 4 instances
5456 blocks: 1 instances
5464 blocks: 3 instances
5480 blocks: 1 instances
5504 blocks: 1 instances
5512 blocks: 2 instances
5520 blocks: 1 instances
5536 blocks: 2 instances
5544 blocks: 1 instances
5552 blocks: 1 instances
5576 blocks: 1 instances
5592 blocks: 1 instances
5632 blocks: 1 instances
5640 blocks: 1 instances
5656 blocks: 1 instances
5672 blocks: 1 instances
5680 blocks: 1 instances
5688 blocks: 1 instances
5696 blocks: 1 instances
5736 blocks: 1 instances
5744 blocks: 2 instances
5760 blocks: 1 instances
5768 blocks: 1 instances
5776 blocks: 2 instances
5784 blocks: 2 instances
5792 blocks: 2 instances
5800 blocks: 1 instances
5816 blocks: 1 instances
5888 blocks: 1 instances
5896 blocks: 1 instances
5912 blocks: 2 instances
5920 blocks: 1 instances
5944 blocks: 1 instances
5952 blocks: 1 instances
5992 blocks: 1 instances
6016 blocks: 1 instances
6072 blocks: 2 instances
6088 blocks: 2 instances
6104 blocks: 1 instances
6144 blocks: 2 instances
6152 blocks: 1 instances
6160 blocks: 1 instances
6176 blocks: 1 instances
6200 blocks: 1 instances
6216 blocks: 1 instances
6224 blocks: 1 instances
6232 blocks: 1 instances
6240 blocks: 2 instances
6256 blocks: 2 instances
6264 blocks: 2 instances
6272 blocks: 1 instances
6280 blocks: 1 instances
6296 blocks: 1 instances
6320 blocks: 1 instances
6344 blocks: 1 instances
6392 blocks: 1 instances
6400 blocks: 1 instances
6408 blocks: 3 instances
6416 blocks: 2 instances
6440 blocks: 1 instances
6448 blocks: 2 instances
6480 blocks: 1 instances
6504 blocks: 2 instances
6520 blocks: 1 instances
6584 blocks: 1 instances
6616 blocks: 1 instances
6624 blocks: 1 instances
6640 blocks: 1 instances
6656 blocks: 1 instances
6672 blocks: 1 instances
6704 blocks: 1 instances
6712 blocks: 1 instances
6736 blocks: 2 instances
6768 blocks: 1 instances
6824 blocks: 1 instances
6864 blocks: 1 instances
6904 blocks: 1 instances
6912 blocks: 1 instances
6944 blocks: 2 instances
6952 blocks: 1 instances
6968 blocks: 1 instances
7032 blocks: 1 instances
7072 blocks: 1 instances
7080 blocks: 2 instances
7104 blocks: 1 instances
7144 blocks: 2 instances
7160 blocks: 1 instances
7168 blocks: 1 instances
7176 blocks: 1 instances
7184 blocks: 1 instances
7200 blocks: 1 instances
7216 blocks: 1 instances
7248 blocks: 1 instances
7272 blocks: 1 instances
7352 blocks: 1 instances
7376 blocks: 2 instances
7408 blocks: 1 instances
7416 blocks: 1 instances
7552 blocks: 2 instances
7576 blocks: 1 instances
7592 blocks: 1 instances
7608 blocks: 1 instances
7648 blocks: 1 instances
7680 blocks: 1 instances
7712 blocks: 1 instances
7784 blocks: 2 instances
7856 blocks: 1 instances
7864 blocks: 1 instances
7896 blocks: 1 instances
7928 blocks: 1 instances
8008 blocks: 1 instances
8080 blocks: 1 instances
8176 blocks: 2 instances
8216 blocks: 1 instances
8248 blocks: 1 instances
8336 blocks: 1 instances
8344 blocks: 1 instances
8432 blocks: 1 instances
8520 blocks: 1 instances
8528 blocks: 1 instances
8632 blocks: 1 instances
8712 blocks: 1 instances
8736 blocks: 1 instances
8776 blocks: 1 instances
8816 blocks: 3 instances
8872 blocks: 2 instances
8944 blocks: 1 instances
8992 blocks: 1 instances
9104 blocks: 1 instances
9208 blocks: 1 instances
9448 blocks: 1 instances
9632 blocks: 1 instances
9984 blocks: 1 instances
10152 blocks: 1 instances
10648 blocks: 1 instances
10744 blocks: 1 instances
10896 blocks: 1 instances
10968 blocks: 1 instances
11552 blocks: 1 instances
11800 blocks: 1 instances
11888 blocks: 1 instances
11936 blocks: 1 instances
12184 blocks: 1 instances
12424 blocks: 1 instances
12624 blocks: 1 instances
12728 blocks: 1 instances
12896 blocks: 1 instances
12920 blocks: 1 instances
13120 blocks: 1 instances
13192 blocks: 1 instances
13408 blocks: 1 instances
13672 blocks: 1 instances
13840 blocks: 1 instances
13968 blocks: 1 instances
14480 blocks: 1 instances
14544 blocks: 1 instances
15168 blocks: 1 instances
15448 blocks: 1 instances
15592 blocks: 1 instances
15704 blocks: 1 instances
15824 blocks: 1 instances
16016 blocks: 1 instances
16240 blocks: 1 instances
16320 blocks: 1 instances
16928 blocks: 1 instances
16976 blocks: 1 instances
17440 blocks: 1 instances
20168 blocks: 1 instances
20552 blocks: 1 instances
20872 blocks: 1 instances
21992 blocks: 1 instances
22016 blocks: 1 instances
23256 blocks: 1 instances
23784 blocks: 1 instances
24248 blocks: 1 instances
24736 blocks: 1 instances
24744 blocks: 1 instances
25096 blocks: 1 instances
25504 blocks: 1 instances
25944 blocks: 1 instances
26112 blocks: 1 instances
26368 blocks: 1 instances
26504 blocks: 1 instances
26768 blocks: 1 instances
28280 blocks: 1 instances
29184 blocks: 1 instances
29592 blocks: 1 instances
29632 blocks: 1 instances
31048 blocks: 1 instances
32440 blocks: 1 instances
34608 blocks: 1 instances
35752 blocks: 1 instances
36080 blocks: 1 instances
36464 blocks: 1 instances
38912 blocks: 1 instances
39752 blocks: 1 instances
40448 blocks: 1 instances
42184 blocks: 1 instances
42616 blocks: 1 instances
43576 blocks: 1 instances
44464 blocks: 1 instances
45656 blocks: 1 instances
48432 blocks: 1 instances
48752 blocks: 1 instances
53144 blocks: 1 instances
54120 blocks: 1 instances
55296 blocks: 1 instances
56584 blocks: 1 instances
59216 blocks: 1 instances
60928 blocks: 1 instances
64000 blocks: 1 instances
64624 blocks: 1 instances
65008 blocks: 1 instances
65024 blocks: 13 instances
65088 blocks: 1 instances
65136 blocks: 1 instances
65168 blocks: 1 instances
65184 blocks: 1 instances
65224 blocks: 1 instances
65232 blocks: 1 instances
65272 blocks: 1 instances
65360 blocks: 1 instances
65368 blocks: 1 instances
65384 blocks: 1 instances
65408 blocks: 1 instances
65416 blocks: 1 instances
65440 blocks: 1 instances
65472 blocks: 1 instances
65488 blocks: 1 instances
65496 blocks: 1 instances
65520 blocks: 1 instances
67072 blocks: 2 instances
68608 blocks: 1 instances
69120 blocks: 1 instances
69456 blocks: 1 instances
69720 blocks: 1 instances
83264 blocks: 1 instances
85272 blocks: 1 instances
94696 blocks: 1 instances
99336 blocks: 1 instances
100208 blocks: 1 instances
100448 blocks: 1 instances
104768 blocks: 1 instances
106992 blocks: 1 instances
112664 blocks: 1 instances
118424 blocks: 1 instances
130048 blocks: 2 instances
130208 blocks: 1 instances
134656 blocks: 1 instances
137568 blocks: 1 instances
156384 blocks: 1 instances
162640 blocks: 1 instances
171728 blocks: 1 instances
189216 blocks: 1 instances
195072 blocks: 1 instances
231424 blocks: 1 instances

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-11 16:31                           ` Jan Ziak
  2022-03-11 20:02                             ` Jan Ziak
@ 2022-03-11 23:04                             ` Qu Wenruo
  2022-03-11 23:28                               ` Jan Ziak
  1 sibling, 1 reply; 71+ messages in thread
From: Qu Wenruo @ 2022-03-11 23:04 UTC (permalink / raw)
  To: Jan Ziak; +Cc: linux-btrfs



On 2022/3/12 00:31, Jan Ziak wrote:
> On Fri, Mar 11, 2022 at 6:04 AM Jan Ziak <0xe2.0x9a.0x9b@gmail.com> wrote:
>>
>> On Fri, Mar 11, 2022 at 3:59 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>> A few outliners can also be fixed by a upcoming patch:
>>> https://patchwork.kernel.org/project/linux-btrfs/patch/d1ce90f37777987732b8ccf0edbfc961cd5c8873.1646912061.git.wqu@suse.com/
>>>
>>> But please note that, the extra patch won't bring a bigger impact as the
>>> previous one, it's mostly a small optimization.
>>
>> I will apply and test the patch and report results.
>
> $ uptime
> 10h54m
>
> CPU time of pid 297: 1h48m
>
> $ cat /proc/297/io  (pid 297 is btrfs-cleaner)
> read_bytes: 4_433_081_716_736
> write_bytes: 788_509_859_840
>
> file.sqlite, before 10h54m: 1827586 extents
>
> file.sqlite, after 10h54m: 1876144 extents
>
> Summary: File fragmentation increased by 48558 extents despite the
> fact that btrfs-cleaner read 4.4 TB, wrote 788 GB and consumed 1h48m
> of CPU time.
>
> If it helps, I can send you the complete list of all the 1.8 million
> extents. I am not sure how long it might take to obtain such a list.

As stated before, autodefrag is not really that useful for database.

So my primary objective here is to make autodefrag cause less CPU/IO for
the worst case scenario.

BTW, have you compared to the number of extents with, or without using
autodefrag?

Thanks,
Qu

>
> -Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-11 23:04                             ` Qu Wenruo
@ 2022-03-11 23:28                               ` Jan Ziak
  2022-03-11 23:39                                 ` Qu Wenruo
  2022-03-12  2:43                                 ` Zygo Blaxell
  0 siblings, 2 replies; 71+ messages in thread
From: Jan Ziak @ 2022-03-11 23:28 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sat, Mar 12, 2022 at 12:04 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> As stated before, autodefrag is not really that useful for database.

Do you realize that you are claiming that btrfs autodefrag should not
- by design - be effective in the case of high-fragmentation files? If
it isn't supposed to be useful for high-fragmentation files then where
is it supposed to be useful? Low-fragmentation files?

-Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-11 23:28                               ` Jan Ziak
@ 2022-03-11 23:39                                 ` Qu Wenruo
  2022-03-12  0:01                                   ` Jan Ziak
  2022-03-12  2:43                                 ` Zygo Blaxell
  1 sibling, 1 reply; 71+ messages in thread
From: Qu Wenruo @ 2022-03-11 23:39 UTC (permalink / raw)
  To: Jan Ziak; +Cc: linux-btrfs

On 2022/3/12 07:28, Jan Ziak wrote:
> On Sat, Mar 12, 2022 at 12:04 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> As stated before, autodefrag is not really that useful for database.
>
> Do you realize that you are claiming that btrfs autodefrag should not
> - by design - be effective in the case of high-fragmentation files?

Unfortunately, that's exactly what I mean.

We all know random writes would cause fragments, but autodefrag is not
like regular defrag ioctl, as it only scan newer extents.

For example:

Our autodefrag is required to defrag writes newer than gen 100, and our
inode has the following layout:

|---Ext A---|--- Ext B---|---Ext C---|---Ext D---|---Ext E---|
     Gen 50       Gen 101     Gen 49      Gen 30      Gen 30

Then autodefrag will only try to defrag extent B and extent C.

Extent B meets the generation requirement, and is mergable with the next
extent C.

But all the remaining extents A, D, E will not be defragged as their
generations don't meet the requirement.

While for regular defrag ioctl, we don't have such generation
requirement, and is able to defrag all extents from A to E.
(But cause way more IO).

Furthermore, autodefrag works by marking the target range dirty, and
wait for writeback (and hopefully get more writes near it, so it can get
even larger)

But if the application, like the database, is calling fsync()
frequently, such re-dirtied range is going to writeback almost
immediately, without any further chance to get merged larger.

Thus the autodefrag effectiveness is almost zero for random writes +
frequently fsync(), which is exactly the database workload.

> If
> it isn't supposed to be useful for high-fragmentation files then where
> is it supposed to be useful? Low-fragmentation files?

Frequently append writes, or less frequently fsync() calls.

Thanks,
Qu

>
> -Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-11 23:39                                 ` Qu Wenruo
@ 2022-03-12  0:01                                   ` Jan Ziak
  2022-03-12  0:15                                     ` Qu Wenruo
  2022-03-12  3:16                                     ` Zygo Blaxell
  0 siblings, 2 replies; 71+ messages in thread
From: Jan Ziak @ 2022-03-12  0:01 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sat, Mar 12, 2022 at 12:39 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> On 2022/3/12 07:28, Jan Ziak wrote:
> > On Sat, Mar 12, 2022 at 12:04 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >> As stated before, autodefrag is not really that useful for database.
> >
> > Do you realize that you are claiming that btrfs autodefrag should not
> > - by design - be effective in the case of high-fragmentation files?
>
> Unfortunately, that's exactly what I mean.
>
> We all know random writes would cause fragments, but autodefrag is not
> like regular defrag ioctl, as it only scan newer extents.
>
> For example:
>
> Our autodefrag is required to defrag writes newer than gen 100, and our
> inode has the following layout:
>
> |---Ext A---|--- Ext B---|---Ext C---|---Ext D---|---Ext E---|
>      Gen 50       Gen 101     Gen 49      Gen 30      Gen 30
>
> Then autodefrag will only try to defrag extent B and extent C.
>
> Extent B meets the generation requirement, and is mergable with the next
> extent C.
>
> But all the remaining extents A, D, E will not be defragged as their
> generations don't meet the requirement.
>
> While for regular defrag ioctl, we don't have such generation
> requirement, and is able to defrag all extents from A to E.
> (But cause way more IO).
>
> Furthermore, autodefrag works by marking the target range dirty, and
> wait for writeback (and hopefully get more writes near it, so it can get
> even larger)
>
> But if the application, like the database, is calling fsync()
> frequently, such re-dirtied range is going to writeback almost
> immediately, without any further chance to get merged larger.

So, basically, what you are saying is that you are refusing to work
together towards fixing/improving the auto-defragmentation algorithm.

Based on your decision in this matter, I am now forced either to find
a replacement filesystem with features similar to btrfs or to
implement a filesystem (where auto-defragmentation works correctly)
myself.

Since I failed to persuade you that there are serious errors/mistakes
in the current btrfs-autodefrag implementation, this is my last email
in this whole forum thread.

Sincerely
Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-12  0:01                                   ` Jan Ziak
@ 2022-03-12  0:15                                     ` Qu Wenruo
  2022-03-12  3:16                                     ` Zygo Blaxell
  1 sibling, 0 replies; 71+ messages in thread
From: Qu Wenruo @ 2022-03-12  0:15 UTC (permalink / raw)
  To: Jan Ziak; +Cc: linux-btrfs



On 2022/3/12 08:01, Jan Ziak wrote:
> On Sat, Mar 12, 2022 at 12:39 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> On 2022/3/12 07:28, Jan Ziak wrote:
>>> On Sat, Mar 12, 2022 at 12:04 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>> As stated before, autodefrag is not really that useful for database.
>>>
>>> Do you realize that you are claiming that btrfs autodefrag should not
>>> - by design - be effective in the case of high-fragmentation files?
>>
>> Unfortunately, that's exactly what I mean.
>>
>> We all know random writes would cause fragments, but autodefrag is not
>> like regular defrag ioctl, as it only scan newer extents.
>>
>> For example:
>>
>> Our autodefrag is required to defrag writes newer than gen 100, and our
>> inode has the following layout:
>>
>> |---Ext A---|--- Ext B---|---Ext C---|---Ext D---|---Ext E---|
>>       Gen 50       Gen 101     Gen 49      Gen 30      Gen 30
>>
>> Then autodefrag will only try to defrag extent B and extent C.
>>
>> Extent B meets the generation requirement, and is mergable with the next
>> extent C.
>>
>> But all the remaining extents A, D, E will not be defragged as their
>> generations don't meet the requirement.
>>
>> While for regular defrag ioctl, we don't have such generation
>> requirement, and is able to defrag all extents from A to E.
>> (But cause way more IO).
>>
>> Furthermore, autodefrag works by marking the target range dirty, and
>> wait for writeback (and hopefully get more writes near it, so it can get
>> even larger)
>>
>> But if the application, like the database, is calling fsync()
>> frequently, such re-dirtied range is going to writeback almost
>> immediately, without any further chance to get merged larger.
>
> So, basically, what you are saying is that you are refusing to work
> together towards fixing/improving the auto-defragmentation algorithm.

I'm explaining how autodefrag works, and work to improve autodefrag to
handle the worst case scenario.

If it doesn't fit your workload, that's unfortunate.
There are always cases btrfs can't handle well.

>
> Based on your decision in this matter, I am now forced either to find
> a replacement filesystem with features similar to btrfs or to
> implement a filesystem (where auto-defragmentation works correctly)
> myself.
>
> Since I failed to persuade you that there are serious errors/mistakes
> in the current btrfs-autodefrag implementation, this is my last email
> in this whole forum thread.
>
> Sincerely
> Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-12  0:01                                   ` Jan Ziak
  2022-03-12  0:15                                     ` Qu Wenruo
@ 2022-03-12  3:16                                     ` Zygo Blaxell
  1 sibling, 0 replies; 71+ messages in thread
From: Zygo Blaxell @ 2022-03-12  3:16 UTC (permalink / raw)
  To: Jan Ziak; +Cc: Qu Wenruo, linux-btrfs

On Sat, Mar 12, 2022 at 01:01:36AM +0100, Jan Ziak wrote:
> On Sat, Mar 12, 2022 at 12:39 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > On 2022/3/12 07:28, Jan Ziak wrote:
> > > On Sat, Mar 12, 2022 at 12:04 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > >> As stated before, autodefrag is not really that useful for database.
> > >
> > > Do you realize that you are claiming that btrfs autodefrag should not
> > > - by design - be effective in the case of high-fragmentation files?
> >
> > Unfortunately, that's exactly what I mean.
> >
> > We all know random writes would cause fragments, but autodefrag is not
> > like regular defrag ioctl, as it only scan newer extents.
> >
> > For example:
> >
> > Our autodefrag is required to defrag writes newer than gen 100, and our
> > inode has the following layout:
> >
> > |---Ext A---|--- Ext B---|---Ext C---|---Ext D---|---Ext E---|
> >      Gen 50       Gen 101     Gen 49      Gen 30      Gen 30
> >
> > Then autodefrag will only try to defrag extent B and extent C.
> >
> > Extent B meets the generation requirement, and is mergable with the next
> > extent C.
> >
> > But all the remaining extents A, D, E will not be defragged as their
> > generations don't meet the requirement.
> >
> > While for regular defrag ioctl, we don't have such generation
> > requirement, and is able to defrag all extents from A to E.
> > (But cause way more IO).
> >
> > Furthermore, autodefrag works by marking the target range dirty, and
> > wait for writeback (and hopefully get more writes near it, so it can get
> > even larger)
> >
> > But if the application, like the database, is calling fsync()
> > frequently, such re-dirtied range is going to writeback almost
> > immediately, without any further chance to get merged larger.
> 
> So, basically, what you are saying is that you are refusing to work
> together towards fixing/improving the auto-defragmentation algorithm.
> 
> Based on your decision in this matter, I am now forced either to find
> a replacement filesystem with features similar to btrfs or to
> implement a filesystem (where auto-defragmentation works correctly)
> myself.

The second of those options is the TL;DR of my previous email, and
you don't need to rewrite any part of btrfs except the autodefrag feature.

I can answer questions to get you started.

You will need to read up on:

	TREE_SEARCH_V2, the search ioctl.  This gives you fast access to
	new extent refs.  You'll need to decode them.  The code in
	btrfs-progs for printing tree items is very useful to see how
	this is done.

	INO_PATHS, the resolve-inode-to-path-name ioctl.  TREE_SEARCH_V2
	will give you inode numbers, but DEFRAG_RANGE needs an open fd.
	This ioctl is the bridge between them.

	DEFRAG_RANGE, the defrag ioctl.  This defrags a range of a file.

The simple daemon model is:

	- track the filesystem transid every 30 seconds, sleep until it changes

	- use the TREE_SEARCH_V2 ioctl to find new extent references since
	the previous transid.  See the 'btrfs sub find-new' implementation
	for details on extracting extent references and filtering by age.
	This has to be run on every subvol individually, but you can
	have a daemon for every subvol, or one process that runs this
	loops over all subvols.

	- examine extent references to see if they are good candidates
	for dedupe:  not too large or too small, no holes between, etc.
	This is a replica of the existing kernel algorithm.  You can
	improve on this immediately by running new searches for
	neighboring extents within optimal defrag range without the
	transid filter.

	- ignore bad extent candidates

	- use INO_PATHS to retrieve the filenames of the inode containing
	the extent.  You can improve on this by filtering filenames of
	files that are known to have extremely high update rates, or any
	other criteria that seem useful.

	- open the file using one of the names, and issue DEFRAG_RANGE
	to defragment the extents.

If you store the last transid persistently (say in a /var file), you
can run one iteration of the loop periodically during periods of low
sensitivity to IO latency.  It doesn't need to run continuously, you
can start and stop it at any time depending on need.

There are a few gotchas.  The main one is that there's an upper bound on
optimal extent size in btrfs, as well as a lower bound.  Extents that
are too large waste space because they cannot be deallocated until the
last reference to the last block is overwritten or deleted.  So you probably
want to stop defragmenting once the extents are 256K or so on a database
file, or it will waste a lot of space.  Use lower values for heavily
active files with random writes, higher values for infrequently
modified files.  Maximum extent size is 128K for a compressed extent,
128M for uncompressed.

> Since I failed to persuade you that there are serious errors/mistakes
> in the current btrfs-autodefrag implementation, this is my last email
> in this whole forum thread.

> Sincerely
> Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-11 23:28                               ` Jan Ziak
  2022-03-11 23:39                                 ` Qu Wenruo
@ 2022-03-12  2:43                                 ` Zygo Blaxell
  2022-03-12  3:24                                   ` Qu Wenruo
  1 sibling, 1 reply; 71+ messages in thread
From: Zygo Blaxell @ 2022-03-12  2:43 UTC (permalink / raw)
  To: Jan Ziak; +Cc: Qu Wenruo, linux-btrfs

On Sat, Mar 12, 2022 at 12:28:10AM +0100, Jan Ziak wrote:
> On Sat, Mar 12, 2022 at 12:04 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > As stated before, autodefrag is not really that useful for database.
> 
> Do you realize that you are claiming that btrfs autodefrag should not
> - by design - be effective in the case of high-fragmentation files? If
> it isn't supposed to be useful for high-fragmentation files then where
> is it supposed to be useful? Low-fragmentation files?

IMHO it's best to deprecate the in-kernel autodefrag option, and start
over with a better approach.  The kernel is the wrong place to solve
this problem, and the undesirable and unfixable things in autodefrag
are a consequence of that early design error.

As far as I can tell, in-kernel autodefrag's only purpose is to provide
exposure to new and exciting bugs on each kernel release, and a lot of
uncontrolled IO demands even when it's working perfectly.  Inevitably,
re-reading old fragments that are no longer in memory will consume RAM
and iops during writeback activity, when memory and IO bandwidth is least
available.  If we avoid expensive re-reading of extents, then we don't
get a useful rate of reduction of fragmentation, because we can't coalesce
small new exists with small existing ones.  If we try to fix these issues
one at a time, the feature would inevitably grow a lot of complicated
and brittle configuration knobs to turn it off selectively, because it's
so awful without extensive filtering.

All the above criticism applies to abstract ideal in-kernel autodefrag,
_before_ considering whether a concrete implementation might have
limitations or bugs which make it worse than the already-bad best case.
5.16 happened to have a lot of examples of these, but fixing the
regressions can only restore autodefrag's relative harmlessness, not
add utility within the constraints the kernel is under.

The right place to do autodefrag is userspace.  Interfaces already
exist for userspace to 1) discover new extents and their neighbors,
quickly and safely, across the entire filesystem; 2) invoke defrag_range
on file extent ranges found in step 1; and 3) run a while (true)
loop that periodically performs steps 1 and 2.  Indeed, the existing
kernel autodefrag implementation is already using the same back-end
infrastructure for parts 1 and 2, so all that would be required for
userspace is to reimplement (and start improving upon) part 3.

A command-line utility or daemon can locate new extents immediately with
tree_search queries, either at filesystem-wide scales, or directed at
user-chosen file subsets.  Tools can quickly assess whether new extents
are good candidates for defrag, then coalesce them with their neighbors.

The user can choose between different tools to decide basic policy
questions like: whether to run once in a batch job or continuously in
the background, what amounts of IO bandwidth and memory to consume,
whether to recompress data with a more aggressive algorithm/level, which
reference to a snapshot-shared extent should be preferred for defrag,
file-type-specific layout optimizations to apply, or any custom or
experimental selection, scheduling, or optimization logic desired.

Implementations can be kept simple because it's not necessary for
userspace tools to pile every possible option into a single implementation,
and support every released option forever (as required for the kernel).
A specialist implementation can discard existing code with impunity or
start from scratch with an experimental algorithm, and spend its life
in a fork of the main userspace autodefrag project with niche users
who never have to cope with generic users' use cases and vice versa.
This efficiently distributes development and maintenance costs.

Userspace autodefrag can be implemented today in any programming language
with btrfs ioctl support, and run on any kernel released in the last
6 years.  Alas, I don't know of anybody who's released a userspace
autodefrag tool yet, and it hasn't been important enough to me to build
one myself (other than a few proof-of-concept prototypes).

For now, I do defrag mostly ad-hoc with 'btrfs fi defrag' on the most
severely fragmented files (top N list of files with the highest extent
counts on the filesystem), and ignore fragmentation everywhere else.

> -Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-12  2:43                                 ` Zygo Blaxell
@ 2022-03-12  3:24                                   ` Qu Wenruo
  2022-03-12  3:48                                     ` Zygo Blaxell
  0 siblings, 1 reply; 71+ messages in thread
From: Qu Wenruo @ 2022-03-12  3:24 UTC (permalink / raw)
  To: Zygo Blaxell, Jan Ziak; +Cc: linux-btrfs



On 2022/3/12 10:43, Zygo Blaxell wrote:
> On Sat, Mar 12, 2022 at 12:28:10AM +0100, Jan Ziak wrote:
>> On Sat, Mar 12, 2022 at 12:04 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>> As stated before, autodefrag is not really that useful for database.
>>
>> Do you realize that you are claiming that btrfs autodefrag should not
>> - by design - be effective in the case of high-fragmentation files? If
>> it isn't supposed to be useful for high-fragmentation files then where
>> is it supposed to be useful? Low-fragmentation files?
>
> IMHO it's best to deprecate the in-kernel autodefrag option, and start
> over with a better approach.  The kernel is the wrong place to solve
> this problem, and the undesirable and unfixable things in autodefrag
> are a consequence of that early design error.

I'm having the same feeling exactly.

Especially the current autodefrag is putting its own policy (transid
filter) without providing a mechanism to utilize from user space.

Exactly the opposite what we should do, provide a mechanism not a policy.

Not to mention there are quite some limitations of the current policy.


But unfortunately, even we deprecate it right now, it will takes a long
time to really remove it from kernel.

While on the other hand, we also need to introduce new parameters like
@newer_than, and @max_to_defrag to the ioctl interface.

Which may already eat up the unused bytes (only 16 bytes, while
newer_than needs u64, max_to_defrag may also need to be u64).

And user space tool lacks one of the critical info, where the small
writes are.

So even I can't be more happier to deprecate the autodefrag, we still
need to hang on it for a pretty lone time, before a user space tool
which can do everything the same as autodefrag.

Thanks,
Qu

>
> As far as I can tell, in-kernel autodefrag's only purpose is to provide
> exposure to new and exciting bugs on each kernel release, and a lot of
> uncontrolled IO demands even when it's working perfectly.  Inevitably,
> re-reading old fragments that are no longer in memory will consume RAM
> and iops during writeback activity, when memory and IO bandwidth is least
> available.  If we avoid expensive re-reading of extents, then we don't
> get a useful rate of reduction of fragmentation, because we can't coalesce
> small new exists with small existing ones.  If we try to fix these issues
> one at a time, the feature would inevitably grow a lot of complicated
> and brittle configuration knobs to turn it off selectively, because it's
> so awful without extensive filtering.
>
> All the above criticism applies to abstract ideal in-kernel autodefrag,
> _before_ considering whether a concrete implementation might have
> limitations or bugs which make it worse than the already-bad best case.
> 5.16 happened to have a lot of examples of these, but fixing the
> regressions can only restore autodefrag's relative harmlessness, not
> add utility within the constraints the kernel is under.
>
> The right place to do autodefrag is userspace.  Interfaces already
> exist for userspace to 1) discover new extents and their neighbors,
> quickly and safely, across the entire filesystem; 2) invoke defrag_range
> on file extent ranges found in step 1; and 3) run a while (true)
> loop that periodically performs steps 1 and 2.  Indeed, the existing
> kernel autodefrag implementation is already using the same back-end
> infrastructure for parts 1 and 2, so all that would be required for
> userspace is to reimplement (and start improving upon) part 3.
>
> A command-line utility or daemon can locate new extents immediately with
> tree_search queries, either at filesystem-wide scales, or directed at
> user-chosen file subsets.  Tools can quickly assess whether new extents
> are good candidates for defrag, then coalesce them with their neighbors.
>
> The user can choose between different tools to decide basic policy
> questions like: whether to run once in a batch job or continuously in
> the background, what amounts of IO bandwidth and memory to consume,
> whether to recompress data with a more aggressive algorithm/level, which
> reference to a snapshot-shared extent should be preferred for defrag,
> file-type-specific layout optimizations to apply, or any custom or
> experimental selection, scheduling, or optimization logic desired.
>
> Implementations can be kept simple because it's not necessary for
> userspace tools to pile every possible option into a single implementation,
> and support every released option forever (as required for the kernel).
> A specialist implementation can discard existing code with impunity or
> start from scratch with an experimental algorithm, and spend its life
> in a fork of the main userspace autodefrag project with niche users
> who never have to cope with generic users' use cases and vice versa.
> This efficiently distributes development and maintenance costs.
>
> Userspace autodefrag can be implemented today in any programming language
> with btrfs ioctl support, and run on any kernel released in the last
> 6 years.  Alas, I don't know of anybody who's released a userspace
> autodefrag tool yet, and it hasn't been important enough to me to build
> one myself (other than a few proof-of-concept prototypes).
>
> For now, I do defrag mostly ad-hoc with 'btrfs fi defrag' on the most
> severely fragmented files (top N list of files with the highest extent
> counts on the filesystem), and ignore fragmentation everywhere else.
>
>
>> -Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-12  3:24                                   ` Qu Wenruo
@ 2022-03-12  3:48                                     ` Zygo Blaxell
  0 siblings, 0 replies; 71+ messages in thread
From: Zygo Blaxell @ 2022-03-12  3:48 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Jan Ziak, linux-btrfs

On Sat, Mar 12, 2022 at 11:24:18AM +0800, Qu Wenruo wrote:
> 
> 
> On 2022/3/12 10:43, Zygo Blaxell wrote:
> > On Sat, Mar 12, 2022 at 12:28:10AM +0100, Jan Ziak wrote:
> > > On Sat, Mar 12, 2022 at 12:04 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > > > As stated before, autodefrag is not really that useful for database.
> > > 
> > > Do you realize that you are claiming that btrfs autodefrag should not
> > > - by design - be effective in the case of high-fragmentation files? If
> > > it isn't supposed to be useful for high-fragmentation files then where
> > > is it supposed to be useful? Low-fragmentation files?
> > 
> > IMHO it's best to deprecate the in-kernel autodefrag option, and start
> > over with a better approach.  The kernel is the wrong place to solve
> > this problem, and the undesirable and unfixable things in autodefrag
> > are a consequence of that early design error.
> 
> I'm having the same feeling exactly.
> 
> Especially the current autodefrag is putting its own policy (transid
> filter) without providing a mechanism to utilize from user space.
> 
> Exactly the opposite what we should do, provide a mechanism not a policy.
> 
> Not to mention there are quite some limitations of the current policy.
> 
> 
> But unfortunately, even we deprecate it right now, it will takes a long
> time to really remove it from kernel.

Agree that we have to keep it around until everyone has moved over to
the new thing; however, we can stop developing the old thing much sooner,
and work on the new thing immediately.

> While on the other hand, we also need to introduce new parameters like
> @newer_than, and @max_to_defrag to the ioctl interface.
> 
> Which may already eat up the unused bytes (only 16 bytes, while
> newer_than needs u64, max_to_defrag may also need to be u64).
> 
> And user space tool lacks one of the critical info, where the small
> writes are.

Userspace can find new extents pretty fast if it's keeping up with writes
in real time.  bees scans do a search for all new extents in the last 30
seconds (not just the small ones) and finish in tenths of milliseconds
with a hot cache.  This is orders of magnitude faster than the actual
defragmentation, which has to do all the data IO twice, copy all the
modified metadata pages, delayed extent refs, and pay the seek costs
for re-reading the fragmented data and writing it somewhere else.

The kernel could maintain the list of autodefrag inodes and simply
provide them to userspace on demand, but honestly I don't think this
list is worth even the tiny amount of memory that it uses.

> So even I can't be more happier to deprecate the autodefrag, we still
> need to hang on it for a pretty lone time, before a user space tool
> which can do everything the same as autodefrag.
> 
> Thanks,
> Qu
> 
> > 
> > As far as I can tell, in-kernel autodefrag's only purpose is to provide
> > exposure to new and exciting bugs on each kernel release, and a lot of
> > uncontrolled IO demands even when it's working perfectly.  Inevitably,
> > re-reading old fragments that are no longer in memory will consume RAM
> > and iops during writeback activity, when memory and IO bandwidth is least
> > available.  If we avoid expensive re-reading of extents, then we don't
> > get a useful rate of reduction of fragmentation, because we can't coalesce
> > small new exists with small existing ones.  If we try to fix these issues
> > one at a time, the feature would inevitably grow a lot of complicated
> > and brittle configuration knobs to turn it off selectively, because it's
> > so awful without extensive filtering.
> > 
> > All the above criticism applies to abstract ideal in-kernel autodefrag,
> > _before_ considering whether a concrete implementation might have
> > limitations or bugs which make it worse than the already-bad best case.
> > 5.16 happened to have a lot of examples of these, but fixing the
> > regressions can only restore autodefrag's relative harmlessness, not
> > add utility within the constraints the kernel is under.
> > 
> > The right place to do autodefrag is userspace.  Interfaces already
> > exist for userspace to 1) discover new extents and their neighbors,
> > quickly and safely, across the entire filesystem; 2) invoke defrag_range
> > on file extent ranges found in step 1; and 3) run a while (true)
> > loop that periodically performs steps 1 and 2.  Indeed, the existing
> > kernel autodefrag implementation is already using the same back-end
> > infrastructure for parts 1 and 2, so all that would be required for
> > userspace is to reimplement (and start improving upon) part 3.
> > 
> > A command-line utility or daemon can locate new extents immediately with
> > tree_search queries, either at filesystem-wide scales, or directed at
> > user-chosen file subsets.  Tools can quickly assess whether new extents
> > are good candidates for defrag, then coalesce them with their neighbors.
> > 
> > The user can choose between different tools to decide basic policy
> > questions like: whether to run once in a batch job or continuously in
> > the background, what amounts of IO bandwidth and memory to consume,
> > whether to recompress data with a more aggressive algorithm/level, which
> > reference to a snapshot-shared extent should be preferred for defrag,
> > file-type-specific layout optimizations to apply, or any custom or
> > experimental selection, scheduling, or optimization logic desired.
> > 
> > Implementations can be kept simple because it's not necessary for
> > userspace tools to pile every possible option into a single implementation,
> > and support every released option forever (as required for the kernel).
> > A specialist implementation can discard existing code with impunity or
> > start from scratch with an experimental algorithm, and spend its life
> > in a fork of the main userspace autodefrag project with niche users
> > who never have to cope with generic users' use cases and vice versa.
> > This efficiently distributes development and maintenance costs.
> > 
> > Userspace autodefrag can be implemented today in any programming language
> > with btrfs ioctl support, and run on any kernel released in the last
> > 6 years.  Alas, I don't know of anybody who's released a userspace
> > autodefrag tool yet, and it hasn't been important enough to me to build
> > one myself (other than a few proof-of-concept prototypes).
> > 
> > For now, I do defrag mostly ad-hoc with 'btrfs fi defrag' on the most
> > severely fragmented files (top N list of files with the highest extent
> > counts on the filesystem), and ignore fragmentation everywhere else.
> > 
> > 
> > > -Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-11  2:59                       ` Qu Wenruo
  2022-03-11  5:04                         ` Jan Ziak
@ 2022-03-14 20:09                         ` Phillip Susi
  2022-03-14 22:59                           ` Zygo Blaxell
  1 sibling, 1 reply; 71+ messages in thread
From: Phillip Susi @ 2022-03-14 20:09 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Jan Ziak, linux-btrfs

Qu Wenruo <quwenruo.btrfs@gmx.com> writes:

> That's more or less expected.
>
> Autodefrag has two limitations:
>
> 1. Only defrag newer writes
>    It doesn't defrag older fragments.
>    This is the existing behavior from the beginning of autodefrag.
>    Thus it's not that effective against small random writes.

I don't understand this bit.  The whole point of defrag is to reduce the
fragmentation of previous writes.  New writes should always attempt to
follow the previous one if possible.  If auto defrag only changes the
behavior of new writes, then how does it change it and why is that not
the way new writes are always done?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-14 20:09                         ` Phillip Susi
@ 2022-03-14 22:59                           ` Zygo Blaxell
  2022-03-15 18:28                             ` Phillip Susi
  2022-03-20 17:50                             ` Forza
  0 siblings, 2 replies; 71+ messages in thread
From: Zygo Blaxell @ 2022-03-14 22:59 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Qu Wenruo, Jan Ziak, linux-btrfs

On Mon, Mar 14, 2022 at 04:09:08PM -0400, Phillip Susi wrote:
> 
> Qu Wenruo <quwenruo.btrfs@gmx.com> writes:
> 
> > That's more or less expected.
> >
> > Autodefrag has two limitations:
> >
> > 1. Only defrag newer writes
> >    It doesn't defrag older fragments.
> >    This is the existing behavior from the beginning of autodefrag.
> >    Thus it's not that effective against small random writes.
> 
> I don't understand this bit.  The whole point of defrag is to reduce the
> fragmentation of previous writes.  New writes should always attempt to
> follow the previous one if possible.  

New writes are allocated to the first available free space hole large
enough to hold them, starting from the point of the last write (plus
some other details like clustering and alignment).  The goal is that
data writes from memory are sequential as much as possible, even if
many different files were written in the same transaction.

btrfs extents are immutable, so the filesystem can't extend an existing
extent with new data.  Instead, a new extent must be created that contains
both the old and new data to replace the old extent.  At least one new
fragment must be created whenever the filesystem is modified.  (In
zoned mode, this is strictly enforced by the underlying hardware.)

> If auto defrag only changes the
> behavior of new writes, then how does it change it and why is that not
> the way new writes are always done?

Autodefrag doesn't change write behavior directly.  It is a
post-processing thread that rereads and rewrites recently written data,
_after_ it was originally written to disk.

In theory, running defrag after the writes means that the writes can
be fast for low latency--they are a physically sequential stream of
blocks sent to the disk as fast as it can write them, because btrfs does
not have to be concerned with trying to achieve physical contiguity
of logically discontiguous data.  Later on, when latency is no longer an
issue and some IO bandwidth is available, the fragments can be reread
and collected together into larger logically and physically contiguous
extents by a background process.

In practice, autodefrag does only part of that task, badly.

Say we have a program that writes 4K to the end of a file, every 5
seconds, for 5 minutes.

Every 30 seconds (default commit interval), kernel writeback submits all
the dirty pages for writing to btrfs, and in 30 seconds there will be 6
x 4K = 24K of those.  An extent in btrfs is created to hold the pages,
filled with the data blocks, connected to the various filesystem trees,
and flushed out to disk.

Over 5 minutes this will happen 10 times, so the file contains 10
fragments, each about 24K (commits are asynchronous, so it might be
20K in one fragment and 28K in the next).

After each commit, inodes with new extents are appended to a list
in memory.  Each list entry contains an inode, a transid of the commit
where the first write occurred, and the last defrag offset.  That list
is processed by a kernel thread some time after the commits are written
to disk.  The thread searches the inodes for extents created after the
last defrag transid, invokes defrag_range on each of these, and advances
the offset.  If the search offset reaches the end of file, then it is
reset to the beginning and another loop is done, and if the next search
loop over the file doesn't find new extents then the inode is removed
from the defrag list.

If there's a 5 minute delay between the original writes and autodefrag
finally catching up, then autodefrag will detect 10 new extents and
run defrag_range over them.  This is a read-then-write operation, since
the extent blocks may no longer be present in memory after writeback,
so autodefrag can easily fall behind writes if there are a lot of them.
Also the 64K size limit kicks in, so it might write 5 extents (2 x 24K =
48K, but 3 x 24K = 72K, and autodefrag cuts off at 64K).

If there's a 1 minute delay between the original writes and autodefrag,
then autodefrag will detect 1 new extents and run defrag over them
for a total of 5 new extents, about 240K each.  If there's no delay
at all, then there will be 10 extents of 120K each--if autodefrag
runs immediately after commit, it will see only one extent in each
loop, and issue no defrag_range calls.

Seen from the point of view of the disk, there are always at least
10x 120K writes.  In the no-autodefrag case it ends there.  In the
autodefrag cases, some of the data is read and rewritten later to make
larger extents.

In non-appending cases, the kernel autodefrag doesn't do very much useful
at all--random writes aren't logically contiguous, so autodefrag never
sees two adjacent extents in a search result, and therefore never sees
an opportunity to defrag anything.

At the time autodefrag was added to the kernel (May 2011), it was already
possible do to a better job in userspace for over a year (Feb 2010).
Between 2012 and 2021 there are only a handful of bug fixes, mostly of
the form "stop autodefrag from ruining things for the rest of the kernel."

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-14 22:59                           ` Zygo Blaxell
@ 2022-03-15 18:28                             ` Phillip Susi
  2022-03-15 19:28                               ` Jan Ziak
  2022-03-15 21:06                               ` Zygo Blaxell
  2022-03-20 17:50                             ` Forza
  1 sibling, 2 replies; 71+ messages in thread
From: Phillip Susi @ 2022-03-15 18:28 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Qu Wenruo, Jan Ziak, linux-btrfs

Zygo Blaxell <ce3g8jdj@umail.furryterror.org> writes:

> btrfs extents are immutable, so the filesystem can't extend an existing
> extent with new data.  Instead, a new extent must be created that contains
> both the old and new data to replace the old extent.  At least one new

Wait, what?  How is an extent immutable?  Why isn't a new tree written
out with a larger extent and once the transaction commits, bam... you've
enlarged your extent?  Just like modifying any other data.

And do you mean to say that before the new data can be written, the old
data must first be read in and moved to the new extent?  That seems
horridly inefficient.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-15 18:28                             ` Phillip Susi
@ 2022-03-15 19:28                               ` Jan Ziak
  2022-03-15 21:06                               ` Zygo Blaxell
  1 sibling, 0 replies; 71+ messages in thread
From: Jan Ziak @ 2022-03-15 19:28 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Zygo Blaxell, Qu Wenruo, linux-btrfs

On Tue, Mar 15, 2022 at 7:34 PM Phillip Susi <phill@thesusis.net> wrote:
> Zygo Blaxell <ce3g8jdj@umail.furryterror.org> writes:
>
> > btrfs extents are immutable, so the filesystem can't extend an existing
> > extent with new data.  Instead, a new extent must be created that contains
> > both the old and new data to replace the old extent.  At least one new
>
> Wait, what?  How is an extent immutable?  Why isn't a new tree written
> out with a larger extent and once the transaction commits, bam... you've
> enlarged your extent?  Just like modifying any other data.
>
> And do you mean to say that before the new data can be written, the old
> data must first be read in and moved to the new extent?  That seems
> horridly inefficient.

I think, one way of how to make sense of this is that, in btrfs, not
just past file-data is immutable due to the fact that it is a CoW
filesystem, but also certain parts of the filesystem's meta-data (such
as: extents) are immutable as well. Modifying meta-data belonging to a
previous (thus, by design, unmodifiable) generation in the btrfs
filesystem is somewhat complicated.

-Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-15 18:28                             ` Phillip Susi
  2022-03-15 19:28                               ` Jan Ziak
@ 2022-03-15 21:06                               ` Zygo Blaxell
  2022-03-15 22:20                                 ` Jan Ziak
  2022-03-16 18:46                                 ` Phillip Susi
  1 sibling, 2 replies; 71+ messages in thread
From: Zygo Blaxell @ 2022-03-15 21:06 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Qu Wenruo, Jan Ziak, linux-btrfs

On Tue, Mar 15, 2022 at 02:28jjjZ:46PM -0400, Phillip Susi wrote:
> Zygo Blaxell <ce3g8jdj@umail.furryterror.org> writes:
> 
> > btrfs extents are immutable, so the filesystem can't extend an existing
> > extent with new data.  Instead, a new extent must be created that contains
> > both the old and new data to replace the old extent.  At least one new
> 
> Wait, what?  How is an extent immutable?  Why isn't a new tree written
> out with a larger extent and once the transaction commits, bam... you've
> enlarged your extent?  Just like modifying any other data.

If the extent is compressed, you have to write a new extent, because
there's no other way to atomically update a compressed extent.

If it's reflinked or snapshotted, you can't overwrite the data in place
as long as a second reference to the data exists.  This is what makes
nodatacow and prealloc slow--on every write, they have to check whether
the blocks being written are shared or not, and that check is expensive
because it's a linear search of every reference for overlapping block
ranges, and it can't exit the search early until it has proven there
are no shared references.  Contrast with datacow, which allocates a new
unshared extent that it knows it can write to, and only has to check
overwritten extents when they are completely overwritten (and only has
to check for the existence of one reference, not enumerate them all).

When a file refers to an extent, it refers to the entire extent from the
file's subvol tree, even if only a single byte of the extent is contained
in the file.  There's no mechanism in btrfs extent tree v1 for atomically
replacing an extent with separately referenceable objects, and updating
all the pointers to parts of the old object to point to the new one.
Any such update could cascade into updates across all reflinks and
snapshots of the extent, so the write multiplier can be arbitrarily large.

There is an extent tree v2 project which provides for splitting
uncompressed extents (compressed extents are always immutable) by storing
all the overlapping references as objects in the extent tree.  It does
reference tracking by creating an extent item for every referenced
block range, so changing one reference's position or length (e.g. by
overwriting or deleting part of an extent reference in a file) doesn't
affect any other reference.  In theory it could also append to the end
of an existing extent, if that case ever came up.

That brings us to the next problem:  mutable extents won't help with
the appending case without also teaching the allocator how to spread out
files all over the disk so there's physical space available at file EOF.

Normally in btrfs, if you write to 3 files, whatever you wrote is packed
into 3 physically contiguous and adjacent extents.  If you then want
to append to the first or second file, you'll need a new extent, because
there's no physical space between the files.

> And do you mean to say that before the new data can be written, the old
> data must first be read in and moved to the new extent?  That seems
> horridly inefficient.

Normally btrfs doesn't read anything when it writes.  New writes create
new extents for the new data, and delete only extents that are completely
replaced by the new extents.

A series of sequential small writes create a lot of small extents,
and small extents are sometimes undesirable.  Defrag gathers these
small extents when they are logically adjacent, reads them into memory,
writes a new physically contiguous extent to replace them, then deletes
the old extents.  Autodefrag is a process that makes defrag happen in
near time to extents that were written recently.

Defrag isn't the only way to resolve the small-extents issue.  If the
file is only read once (e.g. a log file that is rotated and compressed
with a high-performance compressor like xz) then defrag is a waste of
read/write cycles--it's better to leave the small fragments where they
are until they are deleted by an application.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-15 21:06                               ` Zygo Blaxell
@ 2022-03-15 22:20                                 ` Jan Ziak
  2022-03-16 17:02                                   ` Zygo Blaxell
  2022-03-16 18:46                                 ` Phillip Susi
  1 sibling, 1 reply; 71+ messages in thread
From: Jan Ziak @ 2022-03-15 22:20 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Phillip Susi, Qu Wenruo, linux-btrfs

On Tue, Mar 15, 2022 at 10:06 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
> This is what makes
> nodatacow and prealloc slow--on every write, they have to check whether
> the blocks being written are shared or not, and that check is expensive
> because it's a linear search of every reference for overlapping block
> ranges, and it can't exit the search early until it has proven there
> are no shared references.  Contrast with datacow, which allocates a new
> unshared extent that it knows it can write to, and only has to check
> overwritten extents when they are completely overwritten (and only has
> to check for the existence of one reference, not enumerate them all).

Some questions:

- Linear nodatacow search: Do you mean that write(fd1, buf1, 4096) to
a larger nodatacow file is slower compared to write(fd2, buf2, 4096)
to a smaller nodatacow file?

- Linear nodatacow search: Does the search happen only with uncached
metadata, or also with metadata cached in RAM?

- Extent tree v2 + nodatacow: V2 also features the linear search (like
v1) or has the search been redesigned to be logarithmic?

-Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-15 22:20                                 ` Jan Ziak
@ 2022-03-16 17:02                                   ` Zygo Blaxell
  2022-03-16 17:48                                     ` Jan Ziak
  0 siblings, 1 reply; 71+ messages in thread
From: Zygo Blaxell @ 2022-03-16 17:02 UTC (permalink / raw)
  To: Jan Ziak; +Cc: Phillip Susi, Qu Wenruo, linux-btrfs

On Tue, Mar 15, 2022 at 11:20:09PM +0100, Jan Ziak wrote:
> On Tue, Mar 15, 2022 at 10:06 PM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> > This is what makes
> > nodatacow and prealloc slow--on every write, they have to check whether
> > the blocks being written are shared or not, and that check is expensive
> > because it's a linear search of every reference for overlapping block
> > ranges, and it can't exit the search early until it has proven there
> > are no shared references.  Contrast with datacow, which allocates a new
> > unshared extent that it knows it can write to, and only has to check
> > overwritten extents when they are completely overwritten (and only has
> > to check for the existence of one reference, not enumerate them all).
> 
> Some questions:
> 
> - Linear nodatacow search: Do you mean that write(fd1, buf1, 4096) to
> a larger nodatacow file is slower compared to write(fd2, buf2, 4096)
> to a smaller nodatacow file?

Size doesn't matter, the number and position of references do.  It's true
that large extents tend to end up with higher average reference counts
than small extents, but that's only spurious correlation--the "large
extent" and "many references" cases are independent.  An 8K nodatacow
extent, where the first 4K block has exactly one reference and the second
4K has 32767 references, requires a 32768 times more CPU work to write
than a 128M extent with a single reference.

In sane cases, there's only one reference to a nodatacow/prealloc extent,
because multiple references will turn off nodatacow and multiple writes
will turn off prealloc, defeating both features.  When there's only one
reference, the linear search for overlapping blocks ends quickly.

In insane cases (after hole punching, snapshots, reflinks, or writes
to prealloc files) there exist multiple references to the extent,
each covering distinct byte ranges of the extent.  The btrfs trees
only index references from leaf metadata pages to the entire extent,
so to calculate the number of times an individual block is referenced,
we have to iterate over every existing reference to see if it happens to
overlap the blocks of interest.  That's O(N) in the number of references
(roughly--e.g. we don't need to examine different snapshots sharing a
metadata page, because every snapshot sharing a metadata page references
the same bytes in the data extent, but I don't know if btrfs implements
that optimization).

We can't simply read the reference count on the extent for various
reasons.  One is that we don't know what the true reference count is
without walking all parent tree nodes toward the root to see if there's
a snapshot.  The extent is referenced by one metadata page, so its
reference count is 1, but the metadata page is shared by multiple tree
roots, so the true reference count is higher.  Another is that a hole
punched into the middle of an extent causes two references from the same
file, where each reference covers a distinct set of blocks.  None of
the individual blocks are shared, but the extent's reference count is 2.

> - Linear nodatacow search: Does the search happen only with uncached
> metadata, or also with metadata cached in RAM?

All metadata is cached in RAM prior to searching.  I think I missed
where you were going with this question.

> - Extent tree v2 + nodatacow: V2 also features the linear search (like
> v1) or has the search been redesigned to be logarithmic?

I haven't seen the implementation, but the design implies a linear
search over the adjacent range of extent physical addresses that is up
to 2 * max_extent_len wide.  It could be made faster with a clever data
structure, which is implied in the project description, but I haven't
seen details.

There are simple ways to make nodatacow fast, but btrfs doesn't implement
them.  e.g. nodatacow could be a subvol property, where reflink and
snapshot is prohibited over the entire subvol when nodatacow is enabled.
That would eliminate the need to ever search extent references on
write--nodatacow writes could safely assume everything in the subvol is
never shared--and it would match the expectations of people who prefer
that nodatacow takes precedence over all incompatible btrfs features.

> -Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-16 17:02                                   ` Zygo Blaxell
@ 2022-03-16 17:48                                     ` Jan Ziak
  2022-03-17  2:11                                       ` Zygo Blaxell
  0 siblings, 1 reply; 71+ messages in thread
From: Jan Ziak @ 2022-03-16 17:48 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Phillip Susi, Qu Wenruo, linux-btrfs

On Wed, Mar 16, 2022 at 6:02 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
> On Tue, Mar 15, 2022 at 11:20:09PM +0100, Jan Ziak wrote:
> > - Linear nodatacow search: Does the search happen only with uncached
> > metadata, or also with metadata cached in RAM?
>
> All metadata is cached in RAM prior to searching.  I think I missed
> where you were going with this question.

The idea behind the question was whether the on-disk format of
metadata differs from the in-memory format of metadata; whether
metadata is being transformed when loading/saving it from/to the
storage device.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-16 17:48                                     ` Jan Ziak
@ 2022-03-17  2:11                                       ` Zygo Blaxell
  0 siblings, 0 replies; 71+ messages in thread
From: Zygo Blaxell @ 2022-03-17  2:11 UTC (permalink / raw)
  To: Jan Ziak; +Cc: Phillip Susi, Qu Wenruo, linux-btrfs

On Wed, Mar 16, 2022 at 06:48:04PM +0100, Jan Ziak wrote:
> On Wed, Mar 16, 2022 at 6:02 PM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> > On Tue, Mar 15, 2022 at 11:20:09PM +0100, Jan Ziak wrote:
> > > - Linear nodatacow search: Does the search happen only with uncached
> > > metadata, or also with metadata cached in RAM?
> >
> > All metadata is cached in RAM prior to searching.  I think I missed
> > where you were going with this question.
> 
> The idea behind the question was whether the on-disk format of
> metadata differs from the in-memory format of metadata; whether
> metadata is being transformed when loading/saving it from/to the
> storage device.

Both things happen.  Metadata reference updates are handled by delayed
refs, which track pending reference updates (mostly with the hope of
eliminating them entirely, as increment/decrement pairs are common).
If these don't cancel out by the end of a transaction, they are turned
into metadata page updates.

Metadata searches use tree mod log, which is an in-memory version of
the history of metadata updates so far in the transaction, since the
metadata page buffers themselves will be out of date.

Anything not in those caches (including everything committed to disk)
is in metadata buffers which are memory buffers in on-disk format.

There is a backref cache which is used for relocation, but not the
nodatacow/prealloc cases (or normal deletes).  Caching doesn't really
work for the writing cases since the metadata is changing under the
cache.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-15 21:06                               ` Zygo Blaxell
  2022-03-15 22:20                                 ` Jan Ziak
@ 2022-03-16 18:46                                 ` Phillip Susi
  2022-03-16 19:59                                   ` Zygo Blaxell
  1 sibling, 1 reply; 71+ messages in thread
From: Phillip Susi @ 2022-03-16 18:46 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Qu Wenruo, Jan Ziak, linux-btrfs

Zygo Blaxell <ce3g8jdj@umail.furryterror.org> writes:

> If the extent is compressed, you have to write a new extent, because
> there's no other way to atomically update a compressed extent.

Right, that makes sense for compression.

> If it's reflinked or snapshotted, you can't overwrite the data in place
> as long as a second reference to the data exists.  This is what makes
> nodatacow and prealloc slow--on every write, they have to check whether
> the blocks being written are shared or not, and that check is expensive
> because it's a linear search of every reference for overlapping block
> ranges, and it can't exit the search early until it has proven there
> are no shared references.  Contrast with datacow, which allocates a new
> unshared extent that it knows it can write to, and only has to check
> overwritten extents when they are completely overwritten (and only has
> to check for the existence of one reference, not enumerate them all).

Right, I know you can't overwrite the data in place.  What I'm not
understanding is why you can't just just write the new data elsewhere
and then free the no longer used portion of the old extent.

> When a file refers to an extent, it refers to the entire extent from the
> file's subvol tree, even if only a single byte of the extent is contained
> in the file.  There's no mechanism in btrfs extent tree v1 for atomically
> replacing an extent with separately referenceable objects, and updating
> all the pointers to parts of the old object to point to the new one.
> Any such update could cascade into updates across all reflinks and
> snapshots of the extent, so the write multiplier can be arbitrarily large.

So the inode in the subvol tree points to an extent in the extent tree,
and then the extent points to the space on disk?  And only one extent in
the extent tree can ever point to a given location on disk?

In other words, if file B is a reflink copy of file A, and you update
one page in file B, it can't just create 3 new extents in the extent
tree: one that refers to the firt part of the original extent, one that
refers to the last part of the original extent, and one for the new
location of the new data?  Instead file B refers to the original extent,
and to one new extent, in such a way that the second superceeds part of
the first only for file B?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-16 18:46                                 ` Phillip Susi
@ 2022-03-16 19:59                                   ` Zygo Blaxell
  0 siblings, 0 replies; 71+ messages in thread
From: Zygo Blaxell @ 2022-03-16 19:59 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Qu Wenruo, Jan Ziak, linux-btrfs

On Wed, Mar 16, 2022 at 02:46:33PM -0400, Phillip Susi wrote:
> 
> Zygo Blaxell <ce3g8jdj@umail.furryterror.org> writes:
> 
> > If the extent is compressed, you have to write a new extent, because
> > there's no other way to atomically update a compressed extent.
> 
> Right, that makes sense for compression.
> 
> > If it's reflinked or snapshotted, you can't overwrite the data in place
> > as long as a second reference to the data exists.  This is what makes
> > nodatacow and prealloc slow--on every write, they have to check whether
> > the blocks being written are shared or not, and that check is expensive
> > because it's a linear search of every reference for overlapping block
> > ranges, and it can't exit the search early until it has proven there
> > are no shared references.  Contrast with datacow, which allocates a new
> > unshared extent that it knows it can write to, and only has to check
> > overwritten extents when they are completely overwritten (and only has
> > to check for the existence of one reference, not enumerate them all).
> 
> Right, I know you can't overwrite the data in place.  What I'm not
> understanding is why you can't just just write the new data elsewhere
> and then free the no longer used portion of the old extent.
> 
> > When a file refers to an extent, it refers to the entire extent from the
> > file's subvol tree, even if only a single byte of the extent is contained
> > in the file.  There's no mechanism in btrfs extent tree v1 for atomically
> > replacing an extent with separately referenceable objects, and updating
> > all the pointers to parts of the old object to point to the new one.
> > Any such update could cascade into updates across all reflinks and
> > snapshots of the extent, so the write multiplier can be arbitrarily large.
> 
> So the inode in the subvol tree points to an extent in the extent tree,
> and then the extent points to the space on disk?  

The extent item tracks ownership of the space on disk.  The extent item
key _is_ the location on disk, so there's no need for a pointer in the
item itself (e.g. read doesn't bother with the extent tree, it just goes
straight from the inode ref to the data blocks and csums).  The extent
tree only comes up to resolve ownership issues, like whether the last
reference to an extent has been removed, or a new reference added,
or whether multiple references to the extent exist.

> And only one extent in
> the extent tree can ever point to a given location on disk?

Correct.  That restriction is characteristic of extent tree v1.
Each extent maintains a list of references to itself.  The extent is
the exclusive owner of the physical space, and ownership of the extent
item is shared by multiple inode references.  Each inode reference knows
which bytes of the extent it is referring to, but this information is
scattered over the subvol trees and not available in the extent tree.

Extent tree v2 creates a separate extent object in the extent tree for
each reflink, and allows the physical regions covered by each extent
to overlap.  The inode reference is the exclusive owner of the extent
item, and ownership of the physical space is shared by multiple extents.
The extent tree in v2 tracks which inodes refer to which specific blocks,
so the availability of a block can be computed without referring to any
other trees.

In v2, free space is recalculated when an extent is removed.  The nearby
extent tree is searched to see if any blocks no longer overlap with an
extent, and any such blocks are added to free space.  To me it looks like
that free space search is O(N), since there's no proposed data structure
to make it not a linear search of every possibly-overlapping extent item
(all extents within MAX_EXTENT_SIZE bytes from the point where space
was freed).

The v2 proposal also has a deferred GC worker, so maybe the O(N)
searches will be performed in a background thread where they aren't as
time-sensitive, and maybe the search cost can be amortized over multiple
deletions near the same physical position.  Deferred GC doesn't help
nodatacow or prealloc though, which have to know whether a block is
shared during the write operation, and can't wait until later.

> In other words, if file B is a reflink copy of file A, and you update
> one page in file B, it can't just create 3 new extents in the extent
> tree: one that refers to the firt part of the original extent, one that
> refers to the last part of the original extent, and one for the new
> location of the new data?  Instead file B refers to the original extent,
> and to one new extent, in such a way that the second superceeds part of
> the first only for file B?

Correct.  Changing an extent in tree v1 requires updating every reference
to the extent, because any inode referring to the entire extent will
now need to refer to 3 distinct extent items.  That means updating
metadata pages in snapshots, and can lead to 4-digit multiples of write
amplification with only a few dozen snapshots--in the worst cases there
are page splits because the old data now needs space for 3x more reference
items.  So in v1 we don't do anything like that--extents are immutable
from the moment they are created until their last reference is deleted.

In v2, file B doesn't refer to file A's extent.  Instead, file B creates
a new extent which overlaps the physical space of file A's extent.
After overwriting the one new page, file B then replaces its reference to
file A's space with two new references to shared parts of file A's space,
and a third new extent item for the new data in B.  If file A is later
deleted, the lack of reference to the middle of the physical space is
(eventually) detected, and the overwritten part of the shared extent
becomes free space.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-14 22:59                           ` Zygo Blaxell
  2022-03-15 18:28                             ` Phillip Susi
@ 2022-03-20 17:50                             ` Forza
  2022-03-20 21:15                               ` Zygo Blaxell
  1 sibling, 1 reply; 71+ messages in thread
From: Forza @ 2022-03-20 17:50 UTC (permalink / raw)
  To: Zygo Blaxell, Phillip Susi; +Cc: Qu Wenruo, Jan Ziak, linux-btrfs




On 2022-03-14 23:59, Zygo Blaxell wrote:
> 
> On Mon, Mar 14, 2022 at 04:09:08PM -0400, Phillip Susi wrote:
>>
>> Qu Wenruo <quwenruo.btrfs@gmx.com> writes:
>>
>>> That's more or less expected.
>>>
>>> Autodefrag has two limitations:
>>>
>>> 1. Only defrag newer writes
>>>     It doesn't defrag older fragments.
>>>     This is the existing behavior from the beginning of autodefrag.
>>>     Thus it's not that effective against small random writes.
>>
>> I don't understand this bit.  The whole point of defrag is to reduce the
>> fragmentation of previous writes.  New writes should always attempt to
>> follow the previous one if possible.
> 

This is my assumption as well. I believe that VM images was one of the 
original use cases (though I have no reference to that at the moment).

The btrfs wiki[1] says the following for autodefrag:

"Though autodefrag affects newly written data, it can read a few 
adjacent blocks (up to 64k) and write the contiguous extent to a new 
location. The adjacent blocks will be unshared."

The Btrfs administration manual[2] says the following:

"When enabled, small random writes into files (in a range of tens of 
kilobytes, currently it’s 64KiB) are detected and queued up for the 
defragmentation process. Not well suited for large database workloads."

> New writes are allocated to the first available free space hole large
> enough to hold them, starting from the point of the last write (plus
> some other details like clustering and alignment).  The goal is that
> data writes from memory are sequential as much as possible, even if
> many different files were written in the same transaction.
> 
> btrfs extents are immutable, so the filesystem can't extend an existing
> extent with new data.  Instead, a new extent must be created that contains
> both the old and new data to replace the old extent.  At least one new
> fragment must be created whenever the filesystem is modified.  (In
> zoned mode, this is strictly enforced by the underlying hardware.)
> 
>> If auto defrag only changes the
>> behavior of new writes, then how does it change it and why is that not
>> the way new writes are always done?
> 
> Autodefrag doesn't change write behavior directly.  It is a
> post-processing thread that rereads and rewrites recently written data,
> _after_ it was originally written to disk.
> 
> In theory, running defrag after the writes means that the writes can
> be fast for low latency--they are a physically sequential stream of
> blocks sent to the disk as fast as it can write them, because btrfs does
> not have to be concerned with trying to achieve physical contiguity
> of logically discontiguous data.  Later on, when latency is no longer an
> issue and some IO bandwidth is available, the fragments can be reread
> and collected together into larger logically and physically contiguous
> extents by a background process.
> 
> In practice, autodefrag does only part of that task, badly.
> 
> Say we have a program that writes 4K to the end of a file, every 5
> seconds, for 5 minutes.
> 
> Every 30 seconds (default commit interval), kernel writeback submits all
> the dirty pages for writing to btrfs, and in 30 seconds there will be 6
> x 4K = 24K of those.  An extent in btrfs is created to hold the pages,
> filled with the data blocks, connected to the various filesystem trees,
> and flushed out to disk.
> 
> Over 5 minutes this will happen 10 times, so the file contains 10
> fragments, each about 24K (commits are asynchronous, so it might be
> 20K in one fragment and 28K in the next).
> 
> After each commit, inodes with new extents are appended to a list
> in memory.  Each list entry contains an inode, a transid of the commit
> where the first write occurred, and the last defrag offset.  That list
> is processed by a kernel thread some time after the commits are written
> to disk.  The thread searches the inodes for extents created after the
> last defrag transid, invokes defrag_range on each of these, and advances
> the offset.  If the search offset reaches the end of file, then it is
> reset to the beginning and another loop is done, and if the next search
> loop over the file doesn't find new extents then the inode is removed
> from the defrag list.
> 
> If there's a 5 minute delay between the original writes and autodefrag
> finally catching up, then autodefrag will detect 10 new extents and
> run defrag_range over them.  This is a read-then-write operation, since
> the extent blocks may no longer be present in memory after writeback,
> so autodefrag can easily fall behind writes if there are a lot of them.
> Also the 64K size limit kicks in, so it might write 5 extents (2 x 24K =
> 48K, but 3 x 24K = 72K, and autodefrag cuts off at 64K).
> 
> If there's a 1 minute delay between the original writes and autodefrag,
> then autodefrag will detect 1 new extents and run defrag over them
> for a total of 5 new extents, about 240K each.  If there's no delay
> at all, then there will be 10 extents of 120K each--if autodefrag
> runs immediately after commit, it will see only one extent in each
> loop, and issue no defrag_range calls.
> 
> Seen from the point of view of the disk, there are always at least
> 10x 120K writes.  In the no-autodefrag case it ends there.  In the
> autodefrag cases, some of the data is read and rewritten later to make
> larger extents.
> 
> In non-appending cases, the kernel autodefrag doesn't do very much useful
> at all--random writes aren't logically contiguous, so autodefrag never
> sees two adjacent extents in a search result, and therefore never sees
> an opportunity to defrag anything.

I have a worst-case scenario with Netdata. It stores historical data in 
ndf files that are up to 1GiB in size. In addition there is a journal 
file of about 100-200MiB. The extents are extremely small and sequential 
read speeds are around 1-2MiB/s (this is a HDD), which makes fetching 
historical data _extremely_ slow.

Kernel 5.16.12, 5.16.16
btrfs-progs v5.16.2

Files:
Size       Date         Name
1073741824 Mar 15 03:02 datafile-1-0000000113.ndf
1024217088 Mar 20 18:10 datafile-1-0000000114.ndf
140648448  Mar 15 03:02 journalfile-1-0000000113.njf
137732096  Mar 20 18:05 journalfile-1-0000000114.njf


# compsize datafile-1-0000000113.ndf
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Processed 1 file, 59407 regular extents (59407 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL      100%      1.0G         1.0G         1.0G
none       100%      1.0G         1.0G         1.0G
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The average size of the extents is here 17KiB


# compsize journalfile-1-0000000113.njf
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Processed 1 file, 34338 regular extents (34338 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL      100%      134M         134M         134M
none       100%      134M         134M         134M
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The average extent of the journal file is 4KiB.

I have "mount -o autodefrag" enabled but it has no effect on these 
files. I have also tried enabling compression with "btrfs propterty set 
compression zstd" but it did not reduce the file size or change the 
amount of extents much.

As a last resort I tried running Netdata behind "eatmydata", but it also 
didn't help.

It seems that this case is exactly as Zygo describes, that small amounts 
of random writes do not get considered for defragment. It takes about 5 
days to fill one of these ndf datafiles (about 8-9MiB per hour).

# filefrag -v datafile-1-0000000114.ndf
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Filesystem type is: 9123683e
File size of datafile-1-0000000114.ndf is 1028616192 (251127 blocks of 
4096 bytes)
  ext:     logical_offset:        physical_offset: length:   expected: 
flags:
    0:        0..       0: 15863417202..15863417202:      1:
    1:        1..       1: 15863417597..15863417597:      1: 15863417203:
    2:        2..       2: 15874142482..15874142482:      1: 15863417598:
    3:        3..       8: 16093579003..16093579008:      6: 15874142483:
    4:        9..      13: 16017881714..16017881718:      5: 16093579009:
    5:       14..      19: 16095939276..16095939281:      6: 16017881719:
    6:       20..      27: 16110397810..16110397817:      8: 16095939282:
    7:       28..      29: 15874165302..15874165303:      2: 16110397818:
    8:       30..      30: 15874160314..15874160314:      1: 15874165304:
    9:       31..      31: 15874164808..15874164808:      1: 15874160315:
   10:       32..      39: 16110399763..16110399770:      8: 15874164809:
   11:       40..      43: 16017882226..16017882229:      4: 16110399771:
   12:       44..      47: 16017882292..16017882295:      4: 16017882230:
   13:       48..      53: 16097265263..16097265268:      6: 16017882296:
   14:       54..      55: 15877195212..15877195213:      2: 16097265269:
   15:       56..      60: 16018077866..16018077870:      5: 15877195214:
   16:       61..      64: 16017882755..16017882758:      4: 16018077871:
   17:       65..      68: 16017882623..16017882626:      4: 16017882759:
   18:       69..      69: 15877196587..15877196587:      1: 16017882627:
   19:       70..      70: 15877198419..15877198419:      1: 15877196588:
   20:       71..      82: 16110463493..16110463504:     12: 15877198420:
   21:       83..      83: 15878073533..15878073533:      1: 16110463505:
   22:       84..      84: 15878073875..15878073875:      1: 15878073534:
   23:       85..      85: 15878074124..15878074124:      1: 15878073876:
   24:       86..      86: 15878074958..15878074958:      1: 15878074125:
   25:       87..      87: 15878268816..15878268816:      1: 15878074959:
   26:       88..      88: 15878297633..15878297633:      1: 15878268817:
   27:       89..      89: 15878144045..15878144045:      1: 15878297634:
   28:       90..      90: 15878144854..15878144854:      1: 15878144046:
   29:       91..      91: 15880621654..15880621654:      1: 15878144855:
   30:       92..      92: 15884311220..15884311220:      1: 15880621655:
   31:       93..      93: 15884314722..15884314722:      1: 15884311221:
   32:       94..      94: 15884314726..15884314726:      1: 15884314723:
   33:       95..      95: 15877198895..15877198895:      1: 15884314727:
   34:       96..      96: 15877199305..15877199305:      1: 15877198896:
   35:       97..      98: 15877199312..15877199313:      2: 15877199306:
   36:       99..     101: 15878346833..15878346835:      3: 15877199314:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> At the time autodefrag was added to the kernel (May 2011), it was already
> possible do to a better job in userspace for over a year (Feb 2010).
> Between 2012 and 2021 there are only a handful of bug fixes, mostly of
> the form "stop autodefrag from ruining things for the rest of the kernel."

Doesn't userspace defragment has the disadvantage that is has to process 
the whole file with all it's extents? But if it could be used to defrag 
only the last few modified extents could help in situations like this? 
Certainly a userspace defragment daemon could be used to implement 
custom policies suitable for specific workloads.

Thanks
Forza


[1] https://btrfs.wiki.kernel.org/index.php/Status
[2] 
https://btrfs.readthedocs.io/en/latest/Administration.html?highlight=autodefrag#mount-options

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-20 17:50                             ` Forza
@ 2022-03-20 21:15                               ` Zygo Blaxell
  0 siblings, 0 replies; 71+ messages in thread
From: Zygo Blaxell @ 2022-03-20 21:15 UTC (permalink / raw)
  To: Forza; +Cc: Phillip Susi, Qu Wenruo, Jan Ziak, linux-btrfs

On Sun, Mar 20, 2022 at 06:50:55PM +0100, Forza wrote:
> 
> 
> 
> On 2022-03-14 23:59, Zygo Blaxell wrote:
> > 
> > On Mon, Mar 14, 2022 at 04:09:08PM -0400, Phillip Susi wrote:
> > > 
> > > Qu Wenruo <quwenruo.btrfs@gmx.com> writes:
> > > 
> > > > That's more or less expected.
> > > > 
> > > > Autodefrag has two limitations:
> > > > 
> > > > 1. Only defrag newer writes
> > > >     It doesn't defrag older fragments.
> > > >     This is the existing behavior from the beginning of autodefrag.
> > > >     Thus it's not that effective against small random writes.
> > > 
> > > I don't understand this bit.  The whole point of defrag is to reduce the
> > > fragmentation of previous writes.  New writes should always attempt to
> > > follow the previous one if possible.
> > 
> 
> This is my assumption as well. I believe that VM images was one of the
> original use cases (though I have no reference to that at the moment).
> 
> The btrfs wiki[1] says the following for autodefrag:
> 
> "Though autodefrag affects newly written data, it can read a few adjacent
> blocks (up to 64k) and write the contiguous extent to a new location. The
> adjacent blocks will be unshared."
> 
> The Btrfs administration manual[2] says the following:
> 
> "When enabled, small random writes into files (in a range of tens of
> kilobytes, currently it’s 64KiB) are detected and queued up for the
> defragmentation process. Not well suited for large database workloads."

These statements are not technically incorrect, but they are an
understatement of the situation.  What's missing is that autodefrag's
very specific behavior isn't useful for typical user workloads prone
to fragmentation.

> > New writes are allocated to the first available free space hole large
> > enough to hold them, starting from the point of the last write (plus
> > some other details like clustering and alignment).  The goal is that
> > data writes from memory are sequential as much as possible, even if
> > many different files were written in the same transaction.
> > 
> > btrfs extents are immutable, so the filesystem can't extend an existing
> > extent with new data.  Instead, a new extent must be created that contains
> > both the old and new data to replace the old extent.  At least one new
> > fragment must be created whenever the filesystem is modified.  (In
> > zoned mode, this is strictly enforced by the underlying hardware.)
> > 
> > > If auto defrag only changes the
> > > behavior of new writes, then how does it change it and why is that not
> > > the way new writes are always done?
> > 
> > Autodefrag doesn't change write behavior directly.  It is a
> > post-processing thread that rereads and rewrites recently written data,
> > _after_ it was originally written to disk.
> > 
> > In theory, running defrag after the writes means that the writes can
> > be fast for low latency--they are a physically sequential stream of
> > blocks sent to the disk as fast as it can write them, because btrfs does
> > not have to be concerned with trying to achieve physical contiguity
> > of logically discontiguous data.  Later on, when latency is no longer an
> > issue and some IO bandwidth is available, the fragments can be reread
> > and collected together into larger logically and physically contiguous
> > extents by a background process.
> > 
> > In practice, autodefrag does only part of that task, badly.
> > 
> > Say we have a program that writes 4K to the end of a file, every 5
> > seconds, for 5 minutes.
> > 
> > Every 30 seconds (default commit interval), kernel writeback submits all
> > the dirty pages for writing to btrfs, and in 30 seconds there will be 6
> > x 4K = 24K of those.  An extent in btrfs is created to hold the pages,
> > filled with the data blocks, connected to the various filesystem trees,
> > and flushed out to disk.
> > 
> > Over 5 minutes this will happen 10 times, so the file contains 10
> > fragments, each about 24K (commits are asynchronous, so it might be
> > 20K in one fragment and 28K in the next).
> > 
> > After each commit, inodes with new extents are appended to a list
> > in memory.  Each list entry contains an inode, a transid of the commit
> > where the first write occurred, and the last defrag offset.  That list
> > is processed by a kernel thread some time after the commits are written
> > to disk.  The thread searches the inodes for extents created after the
> > last defrag transid, invokes defrag_range on each of these, and advances
> > the offset.  If the search offset reaches the end of file, then it is
> > reset to the beginning and another loop is done, and if the next search
> > loop over the file doesn't find new extents then the inode is removed
> > from the defrag list.
> > 
> > If there's a 5 minute delay between the original writes and autodefrag
> > finally catching up, then autodefrag will detect 10 new extents and
> > run defrag_range over them.  This is a read-then-write operation, since
> > the extent blocks may no longer be present in memory after writeback,
> > so autodefrag can easily fall behind writes if there are a lot of them.
> > Also the 64K size limit kicks in, so it might write 5 extents (2 x 24K =
> > 48K, but 3 x 24K = 72K, and autodefrag cuts off at 64K).
> > 
> > If there's a 1 minute delay between the original writes and autodefrag,
> > then autodefrag will detect 1 new extents and run defrag over them
> > for a total of 5 new extents, about 240K each.  If there's no delay
> > at all, then there will be 10 extents of 120K each--if autodefrag
> > runs immediately after commit, it will see only one extent in each
> > loop, and issue no defrag_range calls.
> > 
> > Seen from the point of view of the disk, there are always at least
> > 10x 120K writes.  In the no-autodefrag case it ends there.  In the
> > autodefrag cases, some of the data is read and rewritten later to make
> > larger extents.
> > 
> > In non-appending cases, the kernel autodefrag doesn't do very much useful
> > at all--random writes aren't logically contiguous, so autodefrag never
> > sees two adjacent extents in a search result, and therefore never sees
> > an opportunity to defrag anything.
> 
> I have a worst-case scenario with Netdata. It stores historical data in ndf
> files that are up to 1GiB in size. In addition there is a journal file of
> about 100-200MiB. The extents are extremely small and sequential read speeds
> are around 1-2MiB/s (this is a HDD), which makes fetching historical data
> _extremely_ slow.
> 
> Kernel 5.16.12, 5.16.16
> btrfs-progs v5.16.2
> 
> Files:
> Size       Date         Name
> 1073741824 Mar 15 03:02 datafile-1-0000000113.ndf
> 1024217088 Mar 20 18:10 datafile-1-0000000114.ndf
> 140648448  Mar 15 03:02 journalfile-1-0000000113.njf
> 137732096  Mar 20 18:05 journalfile-1-0000000114.njf
> 
> 
> # compsize datafile-1-0000000113.ndf
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Processed 1 file, 59407 regular extents (59407 refs), 0 inline.
> Type       Perc     Disk Usage   Uncompressed Referenced
> TOTAL      100%      1.0G         1.0G         1.0G
> none       100%      1.0G         1.0G         1.0G
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> The average size of the extents is here 17KiB
> 
> 
> # compsize journalfile-1-0000000113.njf
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Processed 1 file, 34338 regular extents (34338 refs), 0 inline.
> Type       Perc     Disk Usage   Uncompressed Referenced
> TOTAL      100%      134M         134M         134M
> none       100%      134M         134M         134M
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> The average extent of the journal file is 4KiB.
> 
> I have "mount -o autodefrag" enabled but it has no effect on these files. I
> have also tried enabling compression with "btrfs propterty set compression
> zstd" but it did not reduce the file size or change the amount of extents
> much.
> 
> As a last resort I tried running Netdata behind "eatmydata", but it also
> didn't help.
> 
> It seems that this case is exactly as Zygo describes, that small amounts of
> random writes do not get considered for defragment. It takes about 5 days to
> fill one of these ndf datafiles (about 8-9MiB per hour).
> 
> # filefrag -v datafile-1-0000000114.ndf
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Filesystem type is: 9123683e
> File size of datafile-1-0000000114.ndf is 1028616192 (251127 blocks of 4096
> bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..       0: 15863417202..15863417202:      1:
>    1:        1..       1: 15863417597..15863417597:      1: 15863417203:
>    2:        2..       2: 15874142482..15874142482:      1: 15863417598:
>    3:        3..       8: 16093579003..16093579008:      6: 15874142483:
>    4:        9..      13: 16017881714..16017881718:      5: 16093579009:
>    5:       14..      19: 16095939276..16095939281:      6: 16017881719:
>    6:       20..      27: 16110397810..16110397817:      8: 16095939282:
>    7:       28..      29: 15874165302..15874165303:      2: 16110397818:
>    8:       30..      30: 15874160314..15874160314:      1: 15874165304:
>    9:       31..      31: 15874164808..15874164808:      1: 15874160315:
>   10:       32..      39: 16110399763..16110399770:      8: 15874164809:
>   11:       40..      43: 16017882226..16017882229:      4: 16110399771:
>   12:       44..      47: 16017882292..16017882295:      4: 16017882230:
>   13:       48..      53: 16097265263..16097265268:      6: 16017882296:
>   14:       54..      55: 15877195212..15877195213:      2: 16097265269:
>   15:       56..      60: 16018077866..16018077870:      5: 15877195214:
>   16:       61..      64: 16017882755..16017882758:      4: 16018077871:
>   17:       65..      68: 16017882623..16017882626:      4: 16017882759:
>   18:       69..      69: 15877196587..15877196587:      1: 16017882627:
>   19:       70..      70: 15877198419..15877198419:      1: 15877196588:
>   20:       71..      82: 16110463493..16110463504:     12: 15877198420:
>   21:       83..      83: 15878073533..15878073533:      1: 16110463505:
>   22:       84..      84: 15878073875..15878073875:      1: 15878073534:
>   23:       85..      85: 15878074124..15878074124:      1: 15878073876:
>   24:       86..      86: 15878074958..15878074958:      1: 15878074125:
>   25:       87..      87: 15878268816..15878268816:      1: 15878074959:
>   26:       88..      88: 15878297633..15878297633:      1: 15878268817:
>   27:       89..      89: 15878144045..15878144045:      1: 15878297634:
>   28:       90..      90: 15878144854..15878144854:      1: 15878144046:
>   29:       91..      91: 15880621654..15880621654:      1: 15878144855:
>   30:       92..      92: 15884311220..15884311220:      1: 15880621655:
>   31:       93..      93: 15884314722..15884314722:      1: 15884311221:
>   32:       94..      94: 15884314726..15884314726:      1: 15884314723:
>   33:       95..      95: 15877198895..15877198895:      1: 15884314727:
>   34:       96..      96: 15877199305..15877199305:      1: 15877198896:
>   35:       97..      98: 15877199312..15877199313:      2: 15877199306:
>   36:       99..     101: 15878346833..15878346835:      3: 15877199314:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > 
> > At the time autodefrag was added to the kernel (May 2011), it was already
> > possible do to a better job in userspace for over a year (Feb 2010).
> > Between 2012 and 2021 there are only a handful of bug fixes, mostly of
> > the form "stop autodefrag from ruining things for the rest of the kernel."
> 
> Doesn't userspace defragment has the disadvantage that is has to process the
> whole file with all it's extents? 

No.  There's a DEFRAG_RANGE ioctl which will defragment a specific
range of a specific file, with some restrictions.  The restrictions
can be worked around by making a temporary copy and deduping it over
the original data.  It's possible to rearrange extents more or less
arbitrarily from userspace, mostly invisible to applications that might
be reading or writing them at the same time.

I forget exactly what the restrictions are.  We would need defrag to skip
the write permission check for superuser, and skip atime/mtime/ctime
updates, the same way the dedupe ioctl does.  We'd also need to remove
any minimum size limits or second-guessing in DEFRAG_RANGE so that it
does precisely what userspace tells it to do.  Currently there's some
hardcoded assumptions built into DEFRAG_RANGE because 'btrfs fi defrag'
isn't smart enough to issue good DEFRAG_RANGE commands.

Unfortunately both defrag and dedupe have a problem where they will only
operate on one extent reference at a time, so a solution that supports
snapshots will involve a mix of both ioctls (defrag to construct a
defragmented extent, dedupe to install that extent in each affected
snapshot, plus some logic to decide whether it's worthwhile to do that
or leave the old extents alone).  That can be done _after_ getting basic
autodefrag working for the easier single-subvol cases.

> But if it could be used to defrag only the
> last few modified extents could help in situations like this? 

"Last few modified extents" is the broken thing the kernel does.
Autodefrag should start with those (they can be found very quickly using
tree_search), but the defrag region must expand to include some of the
older adjacent extents too.

> Certainly a
> userspace defragment daemon could be used to implement custom policies
> suitable for specific workloads.
> 
> Thanks
> Forza
> 
> 
> [1] https://btrfs.wiki.kernel.org/index.php/Status
> [2] https://btrfs.readthedocs.io/en/latest/Administration.html?highlight=autodefrag#mount-options
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-07  2:39     ` Qu Wenruo
  2022-03-07  7:31       ` Qu Wenruo
@ 2022-03-08 21:57       ` Jan Ziak
  2022-03-08 23:40         ` Qu Wenruo
  2022-03-09  4:48         ` Zygo Blaxell
  1 sibling, 2 replies; 71+ messages in thread
From: Jan Ziak @ 2022-03-08 21:57 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Mar 7, 2022 at 3:39 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> On 2022/3/7 10:23, Jan Ziak wrote:
> > BTW: "compsize file-with-million-extents" finishes in 0.2 seconds
> > (uses BTRFS_IOC_TREE_SEARCH_V2 ioctl), but "filefrag
> > file-with-million-extents" doesn't finish even after several minutes
> > of time (uses FS_IOC_FIEMAP ioctl - manages to perform only about 5
> > ioctl syscalls per second - and appears to be slowing down as the
> > value of the "fm_start" ioctl argument grows; e2fsprogs version
> > 1.46.5). It would be nice if filefrag was faster than just a few
> > ioctls per second.
>
> This is mostly a race with autodefrag.
>
> Both are using file extent map, thus if autodefrag is still trying to
> redirty the file again and again, it would definitely cause problems for
> anything also using file extent map.

It isn't caused by a race with autodefrag, but by something else.
Autodefrag was turned off when I was running "filefrag
file-with-million-extents".

$ /usr/bin/time filefrag file-with-million-extents.sqlite
Ctrl+C Command terminated by signal 2
0.000000 user, 4.327145 system, 0:04.331167 elapsed, 99% CPU

Sincerely
Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-08 21:57       ` Jan Ziak
@ 2022-03-08 23:40         ` Qu Wenruo
  2022-03-09 22:22           ` Jan Ziak
  2022-03-09  4:48         ` Zygo Blaxell
  1 sibling, 1 reply; 71+ messages in thread
From: Qu Wenruo @ 2022-03-08 23:40 UTC (permalink / raw)
  To: Jan Ziak; +Cc: linux-btrfs



On 2022/3/9 05:57, Jan Ziak wrote:
> On Mon, Mar 7, 2022 at 3:39 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> On 2022/3/7 10:23, Jan Ziak wrote:
>>> BTW: "compsize file-with-million-extents" finishes in 0.2 seconds
>>> (uses BTRFS_IOC_TREE_SEARCH_V2 ioctl), but "filefrag
>>> file-with-million-extents" doesn't finish even after several minutes
>>> of time (uses FS_IOC_FIEMAP ioctl - manages to perform only about 5
>>> ioctl syscalls per second - and appears to be slowing down as the
>>> value of the "fm_start" ioctl argument grows; e2fsprogs version
>>> 1.46.5). It would be nice if filefrag was faster than just a few
>>> ioctls per second.
>>
>> This is mostly a race with autodefrag.
>>
>> Both are using file extent map, thus if autodefrag is still trying to
>> redirty the file again and again, it would definitely cause problems for
>> anything also using file extent map.
>
> It isn't caused by a race with autodefrag, but by something else.
> Autodefrag was turned off when I was running "filefrag
> file-with-million-extents".
>
> $ /usr/bin/time filefrag file-with-million-extents.sqlite
> Ctrl+C Command terminated by signal 2
> 0.000000 user, 4.327145 system, 0:04.331167 elapsed, 99% CPU

Too many file extents will slow down the full fiemap call.

If you use ranged fiemap, like:

  xfs_io -c "fiemap -v 0 4k" <file>

It should finish very quick.

BTW, I have send out a new autodefrag patch and CCed you.
Mind to test that patch? (Better with trace events enabled)

Thanks,
Qu

>
> Sincerely
> Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-08 23:40         ` Qu Wenruo
@ 2022-03-09 22:22           ` Jan Ziak
  2022-03-09 22:44             ` Qu Wenruo
  0 siblings, 1 reply; 71+ messages in thread
From: Jan Ziak @ 2022-03-09 22:22 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Mar 9, 2022 at 12:40 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> On 2022/3/9 05:57, Jan Ziak wrote:
> > On Mon, Mar 7, 2022 at 3:39 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >> On 2022/3/7 10:23, Jan Ziak wrote:
> >>> BTW: "compsize file-with-million-extents" finishes in 0.2 seconds
> >>> (uses BTRFS_IOC_TREE_SEARCH_V2 ioctl), but "filefrag
> >>> file-with-million-extents" doesn't finish even after several minutes
> >>> of time (uses FS_IOC_FIEMAP ioctl - manages to perform only about 5
> >>> ioctl syscalls per second - and appears to be slowing down as the
> >>> value of the "fm_start" ioctl argument grows; e2fsprogs version
> >>> 1.46.5). It would be nice if filefrag was faster than just a few
> >>> ioctls per second.
> >>
> >> This is mostly a race with autodefrag.
> >>
> >> Both are using file extent map, thus if autodefrag is still trying to
> >> redirty the file again and again, it would definitely cause problems for
> >> anything also using file extent map.
> >
> > It isn't caused by a race with autodefrag, but by something else.
> > Autodefrag was turned off when I was running "filefrag
> > file-with-million-extents".
> >
> > $ /usr/bin/time filefrag file-with-million-extents.sqlite
> > Ctrl+C Command terminated by signal 2
> > 0.000000 user, 4.327145 system, 0:04.331167 elapsed, 99% CPU
>
> Too many file extents will slow down the full fiemap call.
>
> If you use ranged fiemap, like:
>
>   xfs_io -c "fiemap -v 0 4k" <file>
>
> It should finish very quick.

Unfortunately, that doesn't seem to be the case (Linux 5.16.12).

xfs_io -c "fiemap -v 0 4g" completes and prints

  ....
  16935: [8387168..8388791]: 22237781600..22237783223  1624   0x0

in 0.6 seconds.

But xfs_io -c "fiemap -v 0 40g" is significantly slower, does not
complete in a reasonable time, and makes it to 1000

   ....
  1000: [154576..154903]: 22232564688..22232565015   328   0x0
   ....

in 6.5 seconds.

The NVMe device was mostly idle when running the above commands (reads
and writes per second were close to zero).

In summary: xfs_io -c "fiemap -v 0 4g" is approximately 185 times
faster than xfs_io -c "fiemap -v 0 40g".

-Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-09 22:22           ` Jan Ziak
@ 2022-03-09 22:44             ` Qu Wenruo
  2022-03-09 22:55               ` Jan Ziak
  0 siblings, 1 reply; 71+ messages in thread
From: Qu Wenruo @ 2022-03-09 22:44 UTC (permalink / raw)
  To: Jan Ziak; +Cc: linux-btrfs



On 2022/3/10 06:22, Jan Ziak wrote:
> On Wed, Mar 9, 2022 at 12:40 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> On 2022/3/9 05:57, Jan Ziak wrote:
>>> On Mon, Mar 7, 2022 at 3:39 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>> On 2022/3/7 10:23, Jan Ziak wrote:
>>>>> BTW: "compsize file-with-million-extents" finishes in 0.2 seconds
>>>>> (uses BTRFS_IOC_TREE_SEARCH_V2 ioctl), but "filefrag
>>>>> file-with-million-extents" doesn't finish even after several minutes
>>>>> of time (uses FS_IOC_FIEMAP ioctl - manages to perform only about 5
>>>>> ioctl syscalls per second - and appears to be slowing down as the
>>>>> value of the "fm_start" ioctl argument grows; e2fsprogs version
>>>>> 1.46.5). It would be nice if filefrag was faster than just a few
>>>>> ioctls per second.
>>>>
>>>> This is mostly a race with autodefrag.
>>>>
>>>> Both are using file extent map, thus if autodefrag is still trying to
>>>> redirty the file again and again, it would definitely cause problems for
>>>> anything also using file extent map.
>>>
>>> It isn't caused by a race with autodefrag, but by something else.
>>> Autodefrag was turned off when I was running "filefrag
>>> file-with-million-extents".
>>>
>>> $ /usr/bin/time filefrag file-with-million-extents.sqlite
>>> Ctrl+C Command terminated by signal 2
>>> 0.000000 user, 4.327145 system, 0:04.331167 elapsed, 99% CPU
>>
>> Too many file extents will slow down the full fiemap call.
>>
>> If you use ranged fiemap, like:
>>
>>    xfs_io -c "fiemap -v 0 4k" <file>
>>
>> It should finish very quick.
>
> Unfortunately, that doesn't seem to be the case (Linux 5.16.12).
>
> xfs_io -c "fiemap -v 0 4g" completes and prints

Well, I literally mean 4k, which is ensured to be one extent.

Thanks,
Qu

>
>    ....
>    16935: [8387168..8388791]: 22237781600..22237783223  1624   0x0
>
> in 0.6 seconds.
>
> But xfs_io -c "fiemap -v 0 40g" is significantly slower, does not
> complete in a reasonable time, and makes it to 1000
>
>     ....
>    1000: [154576..154903]: 22232564688..22232565015   328   0x0
>     ....
>
> in 6.5 seconds.
>
> The NVMe device was mostly idle when running the above commands (reads
> and writes per second were close to zero).
>
> In summary: xfs_io -c "fiemap -v 0 4g" is approximately 185 times
> faster than xfs_io -c "fiemap -v 0 40g".
>
> -Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-09 22:44             ` Qu Wenruo
@ 2022-03-09 22:55               ` Jan Ziak
  2022-03-09 23:00                 ` Jan Ziak
  0 siblings, 1 reply; 71+ messages in thread
From: Jan Ziak @ 2022-03-09 22:55 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Mar 9, 2022 at 11:44 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > xfs_io -c "fiemap -v 0 40g"
>
> Well, I literally mean 4k, which is ensured to be one extent.

The usefulness of such information would be 4k/40g = 1e-6 = 0.0001%.

-Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-09 22:55               ` Jan Ziak
@ 2022-03-09 23:00                 ` Jan Ziak
  0 siblings, 0 replies; 71+ messages in thread
From: Jan Ziak @ 2022-03-09 23:00 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Mar 9, 2022 at 11:55 PM Jan Ziak <0xe2.0x9a.0x9b@gmail.com> wrote:
>
> On Wed, Mar 9, 2022 at 11:44 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > > xfs_io -c "fiemap -v 0 40g"
> >
> > Well, I literally mean 4k, which is ensured to be one extent.
>
> The usefulness of such information would be 4k/40g = 1e-6 = 0.0001%.

1e-7 or 0.00001%, of course. Sorry about the confusion.

-Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-08 21:57       ` Jan Ziak
  2022-03-08 23:40         ` Qu Wenruo
@ 2022-03-09  4:48         ` Zygo Blaxell
  1 sibling, 0 replies; 71+ messages in thread
From: Zygo Blaxell @ 2022-03-09  4:48 UTC (permalink / raw)
  To: Jan Ziak; +Cc: Qu Wenruo, linux-btrfs

On Tue, Mar 08, 2022 at 10:57:51PM +0100, Jan Ziak wrote:
> On Mon, Mar 7, 2022 at 3:39 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > On 2022/3/7 10:23, Jan Ziak wrote:
> > > BTW: "compsize file-with-million-extents" finishes in 0.2 seconds
> > > (uses BTRFS_IOC_TREE_SEARCH_V2 ioctl), but "filefrag
> > > file-with-million-extents" doesn't finish even after several minutes
> > > of time (uses FS_IOC_FIEMAP ioctl - manages to perform only about 5
> > > ioctl syscalls per second - and appears to be slowing down as the
> > > value of the "fm_start" ioctl argument grows; e2fsprogs version
> > > 1.46.5). It would be nice if filefrag was faster than just a few
> > > ioctls per second.
> >
> > This is mostly a race with autodefrag.
> >
> > Both are using file extent map, thus if autodefrag is still trying to
> > redirty the file again and again, it would definitely cause problems for
> > anything also using file extent map.
> 
> It isn't caused by a race with autodefrag, but by something else.
> Autodefrag was turned off when I was running "filefrag
> file-with-million-extents".
> 
> $ /usr/bin/time filefrag file-with-million-extents.sqlite
> Ctrl+C Command terminated by signal 2
> 0.000000 user, 4.327145 system, 0:04.331167 elapsed, 99% CPU

FIEMAP will try to populate the SHARED bit for each extent, which requires
checking every extent that overlaps a block range to see if the block is
present.  It can be very expensive for large, random-written files.

No way to fix that without disabling the SHARED bit in FIEMAP.

> Sincerely
> Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-06 15:59 Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit Jan Ziak
  2022-03-07  0:48 ` Qu Wenruo
@ 2022-03-07 14:30 ` Phillip Susi
  2022-03-08 21:43   ` Jan Ziak
  2022-03-16 12:47 ` Kai Krakow
  2 siblings, 1 reply; 71+ messages in thread
From: Phillip Susi @ 2022-03-07 14:30 UTC (permalink / raw)
  To: Jan Ziak; +Cc: linux-btrfs


Jan Ziak <0xe2.0x9a.0x9b@gmail.com> writes:

> Manual defragmentation decreased the file's size by 7 GB:

Eh?  How does defragging change a file's size?


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-07 14:30 ` Phillip Susi
@ 2022-03-08 21:43   ` Jan Ziak
  2022-03-09 18:46     ` Phillip Susi
  0 siblings, 1 reply; 71+ messages in thread
From: Jan Ziak @ 2022-03-08 21:43 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-btrfs

On Mon, Mar 7, 2022 at 3:31 PM Phillip Susi <phill@thesusis.net> wrote:
> Jan Ziak <0xe2.0x9a.0x9b@gmail.com> writes:
> > Manual defragmentation decreased the file's size by 7 GB:
> Eh?  How does defragging change a file's size?

I noticed this inaccurate wording in my email as well, but that was
(unfortunately) after I already sent the email. I was hoping that
after examining the compsize logs present in the email, readers would
understand what the term "file size" means in this particular case.

Sincerely
Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-08 21:43   ` Jan Ziak
@ 2022-03-09 18:46     ` Phillip Susi
  2022-03-09 21:35       ` Jan Ziak
  0 siblings, 1 reply; 71+ messages in thread
From: Phillip Susi @ 2022-03-09 18:46 UTC (permalink / raw)
  To: Jan Ziak; +Cc: linux-btrfs


Jan Ziak <0xe2.0x9a.0x9b@gmail.com> writes:

> I noticed this inaccurate wording in my email as well, but that was
> (unfortunately) after I already sent the email. I was hoping that
> after examining the compsize logs present in the email, readers would
> understand what the term "file size" means in this particular case.

I don't understand.  You stated that the size decreased by 7 GB, and the
size figures that followed appeared to bear that out.




^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-09 18:46     ` Phillip Susi
@ 2022-03-09 21:35       ` Jan Ziak
  2022-03-14 20:02         ` Phillip Susi
  0 siblings, 1 reply; 71+ messages in thread
From: Jan Ziak @ 2022-03-09 21:35 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-btrfs

On Wed, Mar 9, 2022 at 7:47 PM Phillip Susi <phill@thesusis.net> wrote:
> Jan Ziak <0xe2.0x9a.0x9b@gmail.com> writes:
>
> > I noticed this inaccurate wording in my email as well, but that was
> > (unfortunately) after I already sent the email. I was hoping that
> > after examining the compsize logs present in the email, readers would
> > understand what the term "file size" means in this particular case.
>
> I don't understand.  You stated that the size decreased by 7 GB, and the
> size figures that followed appeared to bear that out.

The actual disk usage of a file in a copy-on-write filesystem can be
much larger than sb.st_size obtained via fstat(fd, &sb) if, for
example, a program performs many (millions) single-byte file changes
using write(fd, buf, 1) to distinct/random offsets in the file.

Before running "btrfs fi de file.sqlite": the disk usage of the file
was 47GB, 1834778 extents

After running "btrfs fi de file.sqlite; sync": the disk usage of the
file was 40GB, 13074 extents

Sincerely
Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-09 21:35       ` Jan Ziak
@ 2022-03-14 20:02         ` Phillip Susi
  2022-03-14 21:53           ` Jan Ziak
  2022-03-16 16:52           ` Andrei Borzenkov
  0 siblings, 2 replies; 71+ messages in thread
From: Phillip Susi @ 2022-03-14 20:02 UTC (permalink / raw)
  To: Jan Ziak; +Cc: linux-btrfs

Jan Ziak <0xe2.0x9a.0x9b@gmail.com> writes:

> The actual disk usage of a file in a copy-on-write filesystem can be
> much larger than sb.st_size obtained via fstat(fd, &sb) if, for
> example, a program performs many (millions) single-byte file changes
> using write(fd, buf, 1) to distinct/random offsets in the file.

How?  I mean if you write to part of the file a new block is written
somewhere else but the original one is then freed, so the overall size
should not change.  Just because all of the blocks of the file are not
contiguous does not mean that the file has more of them, and making them
contiguous does not reduce the number of them.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-14 20:02         ` Phillip Susi
@ 2022-03-14 21:53           ` Jan Ziak
  2022-03-14 22:24             ` Remi Gauvin
  2022-03-15 18:15             ` Phillip Susi
  2022-03-16 16:52           ` Andrei Borzenkov
  1 sibling, 2 replies; 71+ messages in thread
From: Jan Ziak @ 2022-03-14 21:53 UTC (permalink / raw)
  Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2479 bytes --]

On Mon, Mar 14, 2022 at 9:05 PM Phillip Susi <phill@thesusis.net> wrote:
> Jan Ziak <0xe2.0x9a.0x9b@gmail.com> writes:
>
> > The actual disk usage of a file in a copy-on-write filesystem can be
> > much larger than sb.st_size obtained via fstat(fd, &sb) if, for
> > example, a program performs many (millions) single-byte file changes
> > using write(fd, buf, 1) to distinct/random offsets in the file.
>
> How?  I mean if you write to part of the file a new block is written
> somewhere else but the original one is then freed, so the overall size
> should not change.  Just because all of the blocks of the file are not
> contiguous does not mean that the file has more of them, and making them
> contiguous does not reduce the number of them.
>

It is true that it is possible to design a copy-on-write filesystem,
albeit it may have additional costs, that will never waste a single
extent even in the case of high-fragmentation files. But, btrfs isn't
such a filesystem.

The manpage of /usr/bin/compsize contains the following diagram (use a
fixed font when viewing):

                +-------+-------+---------------+
       extent A | used  | waste | used          |
                +-------+-------+---------------+
       extent B         | used  |
                        +-------+

However, what the manpage doesn't mention is that, in the case of
btrfs, the above diagram applies not only to compressed extents but to
other types of extents as well.

You can examine this yourself if you compile compsize-1.5 using "make
debug" on your machine and use the Bash script that is attached to
this email.

The Bash script creates one 10 MiB file. This file has 1 extent of
size 10 MiB (assuming the btrfs filesystem has enough non-fragmented
free space to create a continuous extent of size 10 MiB). Then the
script writes random 4K blocks at random 4K offsets in the file.
Examination of compsize's debug output shows that the whole 10 MiB
extent is still stored on the storage device, despite the fact that
many of the 4K pages comprising the 10 MiB extent have been
overwritten and the file has been synced to the storage device:

....
regular: ram_bytes=10485760 compression=0 disk_num_bytes=10485760
....

In this test case, "Disk Usage" is 60% higher than the file's size:

$ compsize data
Processed 1 file, 612 regular extents (1221 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL      100%       16M          16M          10M

-Jan

[-- Attachment #2: btrfs-waste.sh --]
[-- Type: application/x-shellscript, Size: 557 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-14 21:53           ` Jan Ziak
@ 2022-03-14 22:24             ` Remi Gauvin
  2022-03-14 22:51               ` Zygo Blaxell
  2022-03-15 18:15             ` Phillip Susi
  1 sibling, 1 reply; 71+ messages in thread
From: Remi Gauvin @ 2022-03-14 22:24 UTC (permalink / raw)
  To: linux-btrfs

On 2022-03-14 5:53 p.m., Jan Ziak wrote:


> ....
> 
> In this test case, "Disk Usage" is 60% higher than the file's size:
> 
> $ compsize data
> Processed 1 file, 612 regular extents (1221 refs), 0 inline.
> Type       Perc     Disk Usage   Uncompressed Referenced
> TOTAL      100%       16M          16M          10M


It would be nice if we could get a mount option to specify maximum
extent size, so this effect could be minimized on SSD without having to
use compress-force.  (Or maybe this should be the default When ssd mode
is automaticallyd detected.)



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-14 22:24             ` Remi Gauvin
@ 2022-03-14 22:51               ` Zygo Blaxell
  2022-03-14 23:07                 ` Remi Gauvin
  0 siblings, 1 reply; 71+ messages in thread
From: Zygo Blaxell @ 2022-03-14 22:51 UTC (permalink / raw)
  To: Remi Gauvin; +Cc: linux-btrfs

On Mon, Mar 14, 2022 at 06:24:43PM -0400, Remi Gauvin wrote:
> On 2022-03-14 5:53 p.m., Jan Ziak wrote:
> 
> 
> > ....
> > 
> > In this test case, "Disk Usage" is 60% higher than the file's size:
> > 
> > $ compsize data
> > Processed 1 file, 612 regular extents (1221 refs), 0 inline.
> > Type       Perc     Disk Usage   Uncompressed Referenced
> > TOTAL      100%       16M          16M          10M
> 
> 
> It would be nice if we could get a mount option to specify maximum
> extent size, so this effect could be minimized on SSD without having to
> use compress-force.  (Or maybe this should be the default When ssd mode
> is automaticallyd detected.)

If you never use prealloc or defrag, it's usually not a problem.

Files mostly fall into two categories:  big sequential writes (where
big extents are better) or small random writes (where big extents are
bad, but you don't have any of those because you're doing small random
writes all the time).  Writeback gets this right most of the time,
so the extents end up the right sizes on disk.

If all your writes are random, short, and aligned to a multiple of 4K,
then you'll end up in a steady state with a lot of short extents and
little to no wasted space.  If you run defrag on that, you end up with
half the space wasted, and if the writes continue, lots of small extents
either way.

Prealloc's bad effects are similar to defrag, but with more reliable
losses.

A mount option to disable prealloc globally might be very useful--I
run a number of apps that think prealloc doesn't waste huge amounts of
CPU time and disk space on datacow files, and I grow weary of patching
or LD_PRELOAD-hacking them all the time to not call fallocate().

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-14 22:51               ` Zygo Blaxell
@ 2022-03-14 23:07                 ` Remi Gauvin
  2022-03-14 23:39                   ` Zygo Blaxell
  0 siblings, 1 reply; 71+ messages in thread
From: Remi Gauvin @ 2022-03-14 23:07 UTC (permalink / raw)
  To: linux-btrfs

On 2022-03-14 6:51 p.m., Zygo Blaxell wrote:

> 
> If you never use prealloc or defrag, it's usually not a problem.
> 

You're assuming that the file is created from scratch on that media.  VM
and databases that are restored from images/backups, or re-written as
some kind of maintenance, (shrink vm images, compress database, or
whatever) become a huge problem.

In one instance, I had a VM image that was taking up more than 100% of
it's filesize due to lack of defrag.  For a while I was regularly
defragmenting those with target size of 100MB as the only way to garbage
collect, but that is a shameful waste of write cycles on SSD.  Adding
compress-force=lzo was the only way for me to solve this issue, (and it
even seems to help performance (on SSD, *not* HDD), though probably not
for small random reads,, I haven't properly compared that.)

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-14 23:07                 ` Remi Gauvin
@ 2022-03-14 23:39                   ` Zygo Blaxell
  2022-03-15 14:14                     ` Remi Gauvin
  0 siblings, 1 reply; 71+ messages in thread
From: Zygo Blaxell @ 2022-03-14 23:39 UTC (permalink / raw)
  To: Remi Gauvin; +Cc: linux-btrfs

On Mon, Mar 14, 2022 at 07:07:44PM -0400, Remi Gauvin wrote:
> On 2022-03-14 6:51 p.m., Zygo Blaxell wrote:
> > If you never use prealloc or defrag, it's usually not a problem.
> 
> You're assuming that the file is created from scratch on that media.  VM
> and databases that are restored from images/backups, or re-written as
> some kind of maintenance, (shrink vm images, compress database, or
> whatever) become a huge problem.

VM images do sometimes combine sequential and random writes and create
a lot of waste.  They are one of the outliers that is a problem case
even with a normal life cycle (as opposed to one interrupted by backup
restore).  A cluster command in SQL can instantly double a DB size.

> In one instance, I had a VM image that was taking up more than 100% of
> it's filesize due to lack of defrag.  For a while I was regularly
> defragmenting those with target size of 100MB as the only way to garbage
> collect, but that is a shameful waste of write cycles on SSD.  Adding
> compress-force=lzo was the only way for me to solve this issue, (and it
> even seems to help performance (on SSD, *not* HDD), though probably not
> for small random reads,, I haven't properly compared that.)

Ideally we'd have a proper garbage collection tool for btrfs that ran
defrag _only_ on extents that are holding references to wasted space,
which is the side-effect of defrag that most people want instead of
what defrag nominally tries to do.  I have it on my already too-long
to-do list.

If we're adding a mount option for this (I'm not opposed to it, I'm
pointing out that it's not the first tool to reach for), then ideally
we'd overload it for the compressed batch size (currently hardcoded
at 512K).  I have IO patterns that would like compress-force to write
128M uncompressed extents, and provide enough extents at a time to keep
all the cores busy sequentially compressing a single extent on each one.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-14 23:39                   ` Zygo Blaxell
@ 2022-03-15 14:14                     ` Remi Gauvin
  2022-03-15 18:51                       ` Zygo Blaxell
  0 siblings, 1 reply; 71+ messages in thread
From: Remi Gauvin @ 2022-03-15 14:14 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On 2022-03-14 7:39 p.m., Zygo Blaxell wrote:

> 
> If we're adding a mount option for this (I'm not opposed to it, I'm
> pointing out that it's not the first tool to reach for), then ideally
> we'd overload it for the compressed batch size (currently hardcoded
> at 512K).

Are there any advantages to extents larger than 256K on ssd Media?  Even
if a much needed garbage collection process were to be created, the
smaller extents would mean less data would need to be re-written, (and
potentially duplicated due to snapshots and ref copies.)

The fine details on how to implement all of this is way over my head,
but it seemed to me like the logic to keep the extents small is already
more or less already there, and would need relatively very little work
to manifest.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-15 14:14                     ` Remi Gauvin
@ 2022-03-15 18:51                       ` Zygo Blaxell
  2022-03-15 19:22                         ` Remi Gauvin
  0 siblings, 1 reply; 71+ messages in thread
From: Zygo Blaxell @ 2022-03-15 18:51 UTC (permalink / raw)
  To: Remi Gauvin; +Cc: linux-btrfs

On Tue, Mar 15, 2022 at 10:14:01AM -0400, Remi Gauvin wrote:
> On 2022-03-14 7:39 p.m., Zygo Blaxell wrote:
> > If we're adding a mount option for this (I'm not opposed to it, I'm
> > pointing out that it's not the first tool to reach for), then ideally
> > we'd overload it for the compressed batch size (currently hardcoded
> > at 512K).
> 
> Are there any advantages to extents larger than 256K on ssd Media?  

The main advantage of larger extents is smaller metadata, and it doesn't
matter very much whether it's SSD or HDD.  Adjacent extents will be in
the same metadata page, so not much is lost with 256K extents even on HDD,
as long as they are physically allocated adjacent to each other.

There is a CPU hit for every extent, and when snapshot pages become
unshared, every distinct extent on the page needs its reference count
updated for the new page.  The costs of small extents add up during
balances, resizes, and snapshot deletes, but on a small filesystem you'd
want smaller extents so that balances and resizes are possible at all
(this is why there's a 128M limit now--previously, extents of multiple
GB were possible).

Averaged across my filesystems, half of the data blocks are in extents
below 512K, and only 1% of extents are 1M or larger.  Capping the extent
size at 256K wouldn't make much difference--the total extent count would
increase by less than 5%.

In my defrag experiments, the pareto limit kicks in at a target extent
size of 100K-200K (anything larger than this doesn't get better when
defragged, anything smaller kills performance if it's _not_ defragged).
256K may already be larger than optimal for some workloads.

> Even if a much needed garbage collection process were to be created,
> the smaller extents would mean less data would need to be re-written,
> (and potentially duplicated due to snapshots and ref copies.)

GC has to take all references into account when computing block
reachability, and it has to eliminate all references to remove garbage,
so there should not be any new duplicate data.  Currently GC has to
be implemented by copying the data and then using dedupe to replace
references to the original data individually, but that could be optimized
with a new kernel ioctl that handles all the references at once with a
lock, instead of comparing the data bytes for each one.

GC could create smaller extents intentionally, by creating new extents
in units of 256K, but reflinking them in reverse order over the original
large extents to prevent coalescing extents in writeback.

GC would also have to figure out whether the IO cost of splitting the
extent is worth the space saving (e.g. don't relocate 100MB of data to
save 4K of disk space, wait until it's at least 1MB of space saved).
That's a sysadmin policy input.

GC is not autodefrag.  If it sees that it has to carve up 100M extents
for sub-64K writes, GC can create 400x 256K extents to replace the large
extents, and only defrag when there's a contiguous range of modified
extents with length 64K or less.  Or whatever sizes turn out to be the
right ones--setting the sizes isn't the hard thing to do here.

Obviously, in that scenario it is more efficient if there's a way to
not write the 100M extents in the first place, but it quickly reaches
a steady state with relatively little wasted space, and doesn't require
tuning knobs in the kernel.

GC + autodefrag could go the other way, too:  make the default extent
size small, but allow autodefrag to request very large extents for files
that have not been modified in a while.  That's inefficient too, but
in other other direction, so it would be a better match for the steady
state of some workloads (e.g. video recording or log files).

Ideally there'd be an "optimum extent size" inheritable inode property,
so we can have databases use tiny extents and video recorders use huge
extents on the same filesystem.  But maybe that's overengineering,
and 256K (128K?  512K?) is within the range of values for most.

> The fine details on how to implement all of this is way over my head,
> but it seemed to me like the logic to keep the extents small is already
> more or less already there, and would need relatively very little work
> to manifest.

There's a #define for maximum new extent length.  It wouldn't be too
difficult to look up that number in fs_info instead, slightly harder
to look it up in an inode.  The limit applies only to new extents,
so there's no backward compatibility issue with the on-disk format.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-15 18:51                       ` Zygo Blaxell
@ 2022-03-15 19:22                         ` Remi Gauvin
  2022-03-15 21:08                           ` Zygo Blaxell
  0 siblings, 1 reply; 71+ messages in thread
From: Remi Gauvin @ 2022-03-15 19:22 UTC (permalink / raw)
  To: Zygo Blaxell, linux-btrfs

On 2022-03-15 2:51 p.m., Zygo Blaxell wrote:

> The main advantage of larger extents is smaller metadata, and it doesn't
> matter very much whether it's SSD or HDD.  Adjacent extents will be in
> the same metadata page, so not much is lost with 256K extents even on HDD,
> as long as they are physically allocated adjacent to each other.
> 

When I tried enabling compress-force on my HDD storage, it *killed*
sequential read performance.  I could write a file out at over
100MB/s... but trying to read that same file sequentially would trash
the drives with less than 5MB/s actually being able to be read.

No such problems were observed on ssd storage.

I was under the impression this problem was caused trying to read files
with the 127k extents,, which, for whatever reason, could not be done
without excessive seeking.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-15 19:22                         ` Remi Gauvin
@ 2022-03-15 21:08                           ` Zygo Blaxell
  0 siblings, 0 replies; 71+ messages in thread
From: Zygo Blaxell @ 2022-03-15 21:08 UTC (permalink / raw)
  To: Remi Gauvin; +Cc: linux-btrfs

On Tue, Mar 15, 2022 at 03:22:43PM -0400, Remi Gauvin wrote:
> On 2022-03-15 2:51 p.m., Zygo Blaxell wrote:
> 
> > The main advantage of larger extents is smaller metadata, and it doesn't
> > matter very much whether it's SSD or HDD.  Adjacent extents will be in
> > the same metadata page, so not much is lost with 256K extents even on HDD,
> > as long as they are physically allocated adjacent to each other.
> > 
> 
> When I tried enabling compress-force on my HDD storage, it *killed*
> sequential read performance.  I could write a file out at over
> 100MB/s... but trying to read that same file sequentially would trash
> the drives with less than 5MB/s actually being able to be read.
> 
> 
> No such problems were observed on ssd storage.

I've seen a similar effect.  I wonder if the small extents are breaking
readahead or something.

> I was under the impression this problem was caused trying to read files
> with the 127k extents,, which, for whatever reason, could not be done
> without excessive seeking.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-14 21:53           ` Jan Ziak
  2022-03-14 22:24             ` Remi Gauvin
@ 2022-03-15 18:15             ` Phillip Susi
  1 sibling, 0 replies; 71+ messages in thread
From: Phillip Susi @ 2022-03-15 18:15 UTC (permalink / raw)
  To: Jan Ziak; +Cc: unlisted-recipients, linux-btrfs

Jan Ziak <0xe2.0x9a.0x9b@gmail.com> writes:

> However, what the manpage doesn't mention is that, in the case of
> btrfs, the above diagram applies not only to compressed extents but to
> other types of extents as well.

Ok, so if you are using compression then your choices are either to read
the entire 128k compressed block, decompress it, update the 4k,
recompress it, and write the whole thing back... or just write the
modified 4k elsewhere and now there is some wasted space in the
compressed block.

But why would something like this happen without compression?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-14 20:02         ` Phillip Susi
  2022-03-14 21:53           ` Jan Ziak
@ 2022-03-16 16:52           ` Andrei Borzenkov
  2022-03-16 18:28             ` Jan Ziak
  2022-03-16 18:31             ` Phillip Susi
  1 sibling, 2 replies; 71+ messages in thread
From: Andrei Borzenkov @ 2022-03-16 16:52 UTC (permalink / raw)
  To: Phillip Susi, Jan Ziak; +Cc: linux-btrfs

On 14.03.2022 23:02, Phillip Susi wrote:
> 
> Jan Ziak <0xe2.0x9a.0x9b@gmail.com> writes:
> 
>> The actual disk usage of a file in a copy-on-write filesystem can be
>> much larger than sb.st_size obtained via fstat(fd, &sb) if, for
>> example, a program performs many (millions) single-byte file changes
>> using write(fd, buf, 1) to distinct/random offsets in the file.
> 
> How?  I mean if you write to part of the file a new block is written
> somewhere else but the original one is then freed, so the overall size
> should not change.  Just because all of the blocks of the file are not
> contiguous does not mean that the file has more of them, and making them
> contiguous does not reduce the number of them.
> 

btrfs does not manage space in fixed size blocks. You describe behavior
of WAFL.

btrfs manages space in variable size extents. If you change 999 bytes in
1000 bytes extent, original extent remains allocated because 1 byte is
still referenced. So actual space consumption is now 1999 bytes.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-16 16:52           ` Andrei Borzenkov
@ 2022-03-16 18:28             ` Jan Ziak
  2022-03-16 18:31             ` Phillip Susi
  1 sibling, 0 replies; 71+ messages in thread
From: Jan Ziak @ 2022-03-16 18:28 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: Phillip Susi, linux-btrfs

On Wed, Mar 16, 2022 at 5:52 PM Andrei Borzenkov <arvidjaar@gmail.com> wrote:
> On 14.03.2022 23:02, Phillip Susi wrote:
> >
> > Jan Ziak <0xe2.0x9a.0x9b@gmail.com> writes:
> >
> >> The actual disk usage of a file in a copy-on-write filesystem can be
> >> much larger than sb.st_size obtained via fstat(fd, &sb) if, for
> >> example, a program performs many (millions) single-byte file changes
> >> using write(fd, buf, 1) to distinct/random offsets in the file.
> >
> > How?  I mean if you write to part of the file a new block is written
> > somewhere else but the original one is then freed, so the overall size
> > should not change.  Just because all of the blocks of the file are not
> > contiguous does not mean that the file has more of them, and making them
> > contiguous does not reduce the number of them.
> >
>
> btrfs does not manage space in fixed size blocks. You describe behavior
> of WAFL.

The "single-byte file changes using write(fd, buf, 1)" was just an
example for the purpose of the discussion. The example isn't related
to the 40GB sqlite file.

-Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-16 16:52           ` Andrei Borzenkov
  2022-03-16 18:28             ` Jan Ziak
@ 2022-03-16 18:31             ` Phillip Susi
  2022-03-16 18:43               ` Andrei Borzenkov
                                 ` (2 more replies)
  1 sibling, 3 replies; 71+ messages in thread
From: Phillip Susi @ 2022-03-16 18:31 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: Jan Ziak, linux-btrfs

Andrei Borzenkov <arvidjaar@gmail.com> writes:

> btrfs manages space in variable size extents. If you change 999 bytes in
> 1000 bytes extent, original extent remains allocated because 1 byte is
> still referenced. So actual space consumption is now 1999 bytes.

Huh?  You can't really do that because the page cache manages files in
4k pages.  If you have a 1M extent and you touch one byte in the file,
thus making one 4k page dirty, are you really saying that btrfs will
write that modified 4k page somewhere else and NOT free the 4k block
that is no longer used?  Why the heck not?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-16 18:31             ` Phillip Susi
@ 2022-03-16 18:43               ` Andrei Borzenkov
  2022-03-16 18:46               ` Jan Ziak
  2022-03-16 19:04               ` Zygo Blaxell
  2 siblings, 0 replies; 71+ messages in thread
From: Andrei Borzenkov @ 2022-03-16 18:43 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Jan Ziak, linux-btrfs

On 16.03.2022 21:31, Phillip Susi wrote:
> 
> Andrei Borzenkov <arvidjaar@gmail.com> writes:
> 
>> btrfs manages space in variable size extents. If you change 999 bytes in
>> 1000 bytes extent, original extent remains allocated because 1 byte is
>> still referenced. So actual space consumption is now 1999 bytes.
> 
> Huh?  You can't really do that because the page cache manages files in
> 4k pages.  If you have a 1M extent and you touch one byte in the file,
> thus making one 4k page dirty, are you really saying that btrfs will
> write that modified 4k page somewhere else and NOT free the 4k block
> that is no longer used?

yes.

>  Why the heck not?
>
Short answer - because it is implemented this way.

There could be arbitrary number of overlapping references to this
extent. To track all of these references on every write to decide
whether extent can be split will likely be prohibitively inefficient.

Alternative is to use fixed sized blocks where freeing space is just a
matter of reference count. This means increasing size of metadata for
keeping track of blocks with unknown impact.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-16 18:31             ` Phillip Susi
  2022-03-16 18:43               ` Andrei Borzenkov
@ 2022-03-16 18:46               ` Jan Ziak
  2022-03-16 19:04               ` Zygo Blaxell
  2 siblings, 0 replies; 71+ messages in thread
From: Jan Ziak @ 2022-03-16 18:46 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Andrei Borzenkov, linux-btrfs

On Wed, Mar 16, 2022 at 7:35 PM Phillip Susi <phill@thesusis.net> wrote:
> Andrei Borzenkov <arvidjaar@gmail.com> writes:
>
> > btrfs manages space in variable size extents. If you change 999 bytes in
> > 1000 bytes extent, original extent remains allocated because 1 byte is
> > still referenced. So actual space consumption is now 1999 bytes.
>
> Huh?  You can't really do that because the page cache manages files in
> 4k pages.  If you have a 1M extent and you touch one byte in the file,
> thus making one 4k page dirty, are you really saying that btrfs will
> write that modified 4k page somewhere else and NOT free the 4k block
> that is no longer used?

The questions "Why ... will it write that modified 4k page somewhere
else?" and "Why ... not free the 4k block that is no longer used?" are
two separate questions.

In any CoW filesystem, the answer to the 1st question is: because it
is a CoW filesystem. Because it is a basic assumption/premise of the
filesystem's design.

The answer to the 2nd question depends on whether the CoW filesystem
is well-optimized to handle such a scenario or not optimized to handle
such a scenario.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-16 18:31             ` Phillip Susi
  2022-03-16 18:43               ` Andrei Borzenkov
  2022-03-16 18:46               ` Jan Ziak
@ 2022-03-16 19:04               ` Zygo Blaxell
  2022-03-17 20:34                 ` Phillip Susi
  2 siblings, 1 reply; 71+ messages in thread
From: Zygo Blaxell @ 2022-03-16 19:04 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Andrei Borzenkov, Jan Ziak, linux-btrfs

On Wed, Mar 16, 2022 at 02:31:34PM -0400, Phillip Susi wrote:
> 
> Andrei Borzenkov <arvidjaar@gmail.com> writes:
> 
> > btrfs manages space in variable size extents. If you change 999 bytes in
> > 1000 bytes extent, original extent remains allocated because 1 byte is
> > still referenced. So actual space consumption is now 1999 bytes.
> 
> Huh?  You can't really do that because the page cache manages files in
> 4k pages.

You can get a 1-byte file reference if you make a reflink of the last
block of a 4097-byte file, or punch a hole in the first 4096 bytes of a
4097-byte file.  This creates a file containing only a reference to the
last byte of the original extent.

In theory you could create a 4098-byte file, then make reflinks from that
file to two other files covering the last 1 and last 2 bytes; however,
that's disallowed in the kernel to make sure that assorted dedupe data
leak shenanigans with shared reflinks that don't all end at the same
byte in the page can't ever happen.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-16 19:04               ` Zygo Blaxell
@ 2022-03-17 20:34                 ` Phillip Susi
  2022-03-17 22:06                   ` Zygo Blaxell
  0 siblings, 1 reply; 71+ messages in thread
From: Phillip Susi @ 2022-03-17 20:34 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Andrei Borzenkov, Jan Ziak, linux-btrfs


Zygo Blaxell <ce3g8jdj@umail.furryterror.org> writes:

> You can get a 1-byte file reference if you make a reflink of the last
> block of a 4097-byte file, or punch a hole in the first 4096 bytes of a
> 4097-byte file.  This creates a file containing only a reference to the
> last byte of the original extent.

So the inode only refers to one byte of the extent, but the extent is
still always a multiple of 4k right?



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-17 20:34                 ` Phillip Susi
@ 2022-03-17 22:06                   ` Zygo Blaxell
  0 siblings, 0 replies; 71+ messages in thread
From: Zygo Blaxell @ 2022-03-17 22:06 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Andrei Borzenkov, Jan Ziak, linux-btrfs

On Thu, Mar 17, 2022 at 04:34:51PM -0400, Phillip Susi wrote:
> Zygo Blaxell <ce3g8jdj@umail.furryterror.org> writes:
> > You can get a 1-byte file reference if you make a reflink of the last
> > block of a 4097-byte file, or punch a hole in the first 4096 bytes of a
> > 4097-byte file.  This creates a file containing only a reference to the
> > last byte of the original extent.
> 
> So the inode only refers to one byte of the extent, but the extent is
> still always a multiple of 4k right?

Yes.

In theory, the on-disk format specifies extent locations and sizes in
bytes.  In practice, the kernel enforces that all the extent physical
boundaries be a multiple of the CPU page size (or multiples of _some_
CPU's page size, with the subpage patches).  On read, anything with a
logical length that isn't a multiple of 4K is zero-filled.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-06 15:59 Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit Jan Ziak
  2022-03-07  0:48 ` Qu Wenruo
  2022-03-07 14:30 ` Phillip Susi
@ 2022-03-16 12:47 ` Kai Krakow
  2022-03-16 18:18   ` Jan Ziak
  2 siblings, 1 reply; 71+ messages in thread
From: Kai Krakow @ 2022-03-16 12:47 UTC (permalink / raw)
  To: Jan Ziak; +Cc: linux-btrfs

Hello!

Am So., 6. März 2022 um 18:57 Uhr schrieb Jan Ziak <0xe2.0x9a.0x9b@gmail.com>:
>
> I would like to report that btrfs in Linux kernel 5.16.12 mounted with
> the autodefrag option wrote 5TB in a single day to a 1TB SSD that is
> about 50% full.
>
> Defragmenting 0.5TB on a drive that is 50% full should write far less than 5TB.
>
> Benefits to the fragmentation of the most written files over the
> course of the one day (sqlite database files) are nil. Please see the
> data below. Also note that the sqlite file is using up to 10 GB more
> than it should due to fragmentation.
>
> CPU utilization on an otherwise idle machine is approximately 600% all
> the time: btrfs-cleaner 100%, kworkers...btrfs 500%.
>
> I am not just asking you to fix this issue - I am asking you how is it
> possible for an algorithm that is significantly worse than O(N*log(N))
> to be merged into the Linux kernel in the first place!?
>
> Please try to avoid discussing no-CoW (chattr +C) in your response,
> because it is beside the point. Thanks.

Yeah, that's one solution. But you could also try disabling
double-write by turning on WAL-mode in sqlite:

Use the cmdline tool to connect to the database file, then run "PRAGMA
journal_mode=WAL;". This can only be switched, when only one client is
connect so you need to temporarily suspend processes using the
database.
(https://dev.to/lefebvre/speed-up-sqlite-with-write-ahead-logging-wal-do)

It may be worth disabling compression: "chmod +m
DIRECTORY-OF-SQLITE-DB", but this can only be switched for newly
created files, so you'd need to rename and copy the existing sqlite
files. This reduces the amount of extents created.

Enabling WAL disables the rollback journal and prevents smallish
in-place updates of data blocks in the database file. Instead, it uses
checkpointing to update the database safely in bigger chunks, thus
using write-patterns better suited for cow-filesystems.

HTH
Kai

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
  2022-03-16 12:47 ` Kai Krakow
@ 2022-03-16 18:18   ` Jan Ziak
  0 siblings, 0 replies; 71+ messages in thread
From: Jan Ziak @ 2022-03-16 18:18 UTC (permalink / raw)
  To: Kai Krakow; +Cc: linux-btrfs

On Wed, Mar 16, 2022 at 1:48 PM Kai Krakow <hurikhan77+btrfs@gmail.com> wrote:
> Am So., 6. März 2022 um 18:57 Uhr schrieb Jan Ziak <0xe2.0x9a.0x9b@gmail.com>:
> > Please try to avoid discussing no-CoW (chattr +C) in your response,
> > because it is beside the point. Thanks.
>
> Yeah, that's one solution. But you could also try disabling
> double-write by turning on WAL-mode in sqlite:

As far as I can tell, the app is using journal_mode=wal for all
database connections using code such as:

c = await aiosqlite.connect(db_path)
await c.execute("pragma journal_mode=wal")

There are sqlite-wal files in all database directories. Compression is
disabled. According to Bash history, I executed "btrfs filesystem
defragment -r" in the 41 GB sqlite directory about 2 days ago. The
current number of extents (after 2 days) is:

$ compsize file.sqlite
Processed 1 file, 1438640 regular extents (2593549 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL      100%       50G          50G          41G
none       100%       50G          50G          41G

-Jan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
@ 2022-06-17  0:20 Jan Ziak
  0 siblings, 0 replies; 71+ messages in thread
From: Jan Ziak @ 2022-06-17  0:20 UTC (permalink / raw)
  To: linux-btrfs

This is a random update to previously reported btrfs fragmentation issues.

Defragmenting a 79 GiB file increased the number of bytes allocated to
the file in a btrfs filesystem from 118 GIB to 161 GiB:

linux 5.17.5
btrfs-progs 5.18.1

$ compsize file.sqlite
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       99%      117G         118G          78G
none       100%      116G         116G          77G
zstd        30%      471M         1.5G         1.2G

$ btrfs filesystem defragment -t 256K file.sqlite

$ compsize file.sqlite
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       99%      160G         161G          78G
none       100%      159G         159G          77G
zstd        28%      405M         1.3G         1.3G

$ dd if=file.sqlite of=file-1.sqlite bs=1M status=progress
84659167232 bytes (85 GB, 79 GiB) copied, 122.376 s, 692 MB/s

$ compsize file-1.sqlite
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       98%       77G          78G          78G
none       100%       77G          77G          77G
zstd        28%      361M         1.2G         1.2G

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2022-06-17  0:21 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-06 15:59 Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit Jan Ziak
2022-03-07  0:48 ` Qu Wenruo
2022-03-07  2:23   ` Jan Ziak
2022-03-07  2:39     ` Qu Wenruo
2022-03-07  7:31       ` Qu Wenruo
2022-03-10  1:10         ` Jan Ziak
2022-03-10  1:26           ` Qu Wenruo
2022-03-10  4:33             ` Jan Ziak
2022-03-10  6:42               ` Qu Wenruo
2022-03-10 21:31                 ` Jan Ziak
2022-03-10 23:27                   ` Qu Wenruo
2022-03-11  2:42                     ` Jan Ziak
2022-03-11  2:59                       ` Qu Wenruo
2022-03-11  5:04                         ` Jan Ziak
2022-03-11 16:31                           ` Jan Ziak
2022-03-11 20:02                             ` Jan Ziak
2022-03-11 23:04                             ` Qu Wenruo
2022-03-11 23:28                               ` Jan Ziak
2022-03-11 23:39                                 ` Qu Wenruo
2022-03-12  0:01                                   ` Jan Ziak
2022-03-12  0:15                                     ` Qu Wenruo
2022-03-12  3:16                                     ` Zygo Blaxell
2022-03-12  2:43                                 ` Zygo Blaxell
2022-03-12  3:24                                   ` Qu Wenruo
2022-03-12  3:48                                     ` Zygo Blaxell
2022-03-14 20:09                         ` Phillip Susi
2022-03-14 22:59                           ` Zygo Blaxell
2022-03-15 18:28                             ` Phillip Susi
2022-03-15 19:28                               ` Jan Ziak
2022-03-15 21:06                               ` Zygo Blaxell
2022-03-15 22:20                                 ` Jan Ziak
2022-03-16 17:02                                   ` Zygo Blaxell
2022-03-16 17:48                                     ` Jan Ziak
2022-03-17  2:11                                       ` Zygo Blaxell
2022-03-16 18:46                                 ` Phillip Susi
2022-03-16 19:59                                   ` Zygo Blaxell
2022-03-20 17:50                             ` Forza
2022-03-20 21:15                               ` Zygo Blaxell
2022-03-08 21:57       ` Jan Ziak
2022-03-08 23:40         ` Qu Wenruo
2022-03-09 22:22           ` Jan Ziak
2022-03-09 22:44             ` Qu Wenruo
2022-03-09 22:55               ` Jan Ziak
2022-03-09 23:00                 ` Jan Ziak
2022-03-09  4:48         ` Zygo Blaxell
2022-03-07 14:30 ` Phillip Susi
2022-03-08 21:43   ` Jan Ziak
2022-03-09 18:46     ` Phillip Susi
2022-03-09 21:35       ` Jan Ziak
2022-03-14 20:02         ` Phillip Susi
2022-03-14 21:53           ` Jan Ziak
2022-03-14 22:24             ` Remi Gauvin
2022-03-14 22:51               ` Zygo Blaxell
2022-03-14 23:07                 ` Remi Gauvin
2022-03-14 23:39                   ` Zygo Blaxell
2022-03-15 14:14                     ` Remi Gauvin
2022-03-15 18:51                       ` Zygo Blaxell
2022-03-15 19:22                         ` Remi Gauvin
2022-03-15 21:08                           ` Zygo Blaxell
2022-03-15 18:15             ` Phillip Susi
2022-03-16 16:52           ` Andrei Borzenkov
2022-03-16 18:28             ` Jan Ziak
2022-03-16 18:31             ` Phillip Susi
2022-03-16 18:43               ` Andrei Borzenkov
2022-03-16 18:46               ` Jan Ziak
2022-03-16 19:04               ` Zygo Blaxell
2022-03-17 20:34                 ` Phillip Susi
2022-03-17 22:06                   ` Zygo Blaxell
2022-03-16 12:47 ` Kai Krakow
2022-03-16 18:18   ` Jan Ziak
2022-06-17  0:20 Jan Ziak

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.