* Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit @ 2022-03-06 15:59 Jan Ziak 2022-03-07 0:48 ` Qu Wenruo ` (2 more replies) 0 siblings, 3 replies; 71+ messages in thread From: Jan Ziak @ 2022-03-06 15:59 UTC (permalink / raw) To: linux-btrfs I would like to report that btrfs in Linux kernel 5.16.12 mounted with the autodefrag option wrote 5TB in a single day to a 1TB SSD that is about 50% full. Defragmenting 0.5TB on a drive that is 50% full should write far less than 5TB. Benefits to the fragmentation of the most written files over the course of the one day (sqlite database files) are nil. Please see the data below. Also note that the sqlite file is using up to 10 GB more than it should due to fragmentation. CPU utilization on an otherwise idle machine is approximately 600% all the time: btrfs-cleaner 100%, kworkers...btrfs 500%. I am not just asking you to fix this issue - I am asking you how is it possible for an algorithm that is significantly worse than O(N*log(N)) to be merged into the Linux kernel in the first place!? Please try to avoid discussing no-CoW (chattr +C) in your response, because it is beside the point. Thanks. ---- A day before: $ smartctl -a /dev/nvme0n1 | grep Units Data Units Read: 449,265,485 [230 TB] Data Units Written: 406,386,721 [208 TB] $ compsize file.sqlite Processed 1 file, 1757129 regular extents (2934077 refs), 0 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 46G 46G 37G none 100% 46G 46G 37G ---- A day after: $ smartctl -a /dev/nvme0n1 | grep Units Data Units Read: 473,211,419 [242 TB] Data Units Written: 417,249,915 [213 TB] $ compsize file.sqlite Processed 1 file, 1834778 regular extents (3050838 refs), 0 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 47G 47G 37G none 100% 47G 47G 37G $ filefrag file.sqlite (Ctrl-C after waiting more than 10 minutes, consuming 100% CPU) ---- Manual defragmentation decreased the file's size by 7 GB: $ btrfs-defrag file.sqlite $ sync $ compsize file.sqlite Processed 6 files, 13074 regular extents (20260 refs), 0 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 40G 40G 37G none 100% 40G 40G 37G ---- Sincerely Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-06 15:59 Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit Jan Ziak @ 2022-03-07 0:48 ` Qu Wenruo 2022-03-07 2:23 ` Jan Ziak 2022-03-07 14:30 ` Phillip Susi 2022-03-16 12:47 ` Kai Krakow 2 siblings, 1 reply; 71+ messages in thread From: Qu Wenruo @ 2022-03-07 0:48 UTC (permalink / raw) To: Jan Ziak, linux-btrfs On 2022/3/6 23:59, Jan Ziak wrote: > I would like to report that btrfs in Linux kernel 5.16.12 mounted with > the autodefrag option wrote 5TB in a single day to a 1TB SSD that is > about 50% full. > > Defragmenting 0.5TB on a drive that is 50% full should write far less than 5TB. If using defrag ioctl, that's a good and solid expectation. > > Benefits to the fragmentation of the most written files over the > course of the one day (sqlite database files) are nil. Please see the > data below. Also note that the sqlite file is using up to 10 GB more > than it should due to fragmentation. Autodefrag will mark any file which got smaller writes (<64K) for scan. For smaller extents than 64K, they will be re-dirtied for writeback. So in theory, if the cleaner is triggered very frequently to do autodefrag, it can indeed easily amplify the writes. Are you using commit= mount option? Which would reduce the commit interval thus trigger autodefrag more frequently. > > CPU utilization on an otherwise idle machine is approximately 600% all > the time: btrfs-cleaner 100%, kworkers...btrfs 500%. The problem is why the CPU usage is at 100% for cleaner. Would you please apply this patch on your kernel? https://patchwork.kernel.org/project/linux-btrfs/patch/bf2635d213e0c85251c4cd0391d8fbf274d7d637.1645705266.git.wqu@suse.com/ Then enable the following trace events: btrfs:defrag_one_locked_range btrfs:defrag_add_target btrfs:defrag_file_start btrfs:defrag_file_end Those trace events would show why we're doing the same re-dirty again and again, and mostly why the CPU usage is so high. Thanks, Qu > > I am not just asking you to fix this issue - I am asking you how is it > possible for an algorithm that is significantly worse than O(N*log(N)) > to be merged into the Linux kernel in the first place!? > > Please try to avoid discussing no-CoW (chattr +C) in your response, > because it is beside the point. Thanks. > > ---- > > A day before: > > $ smartctl -a /dev/nvme0n1 | grep Units > Data Units Read: 449,265,485 [230 TB] > Data Units Written: 406,386,721 [208 TB] > > $ compsize file.sqlite > Processed 1 file, 1757129 regular extents (2934077 refs), 0 inline. > Type Perc Disk Usage Uncompressed Referenced > TOTAL 100% 46G 46G 37G > none 100% 46G 46G 37G > > ---- > > A day after: > > $ smartctl -a /dev/nvme0n1 | grep Units > Data Units Read: 473,211,419 [242 TB] > Data Units Written: 417,249,915 [213 TB] > > $ compsize file.sqlite > Processed 1 file, 1834778 regular extents (3050838 refs), 0 inline. > Type Perc Disk Usage Uncompressed Referenced > TOTAL 100% 47G 47G 37G > none 100% 47G 47G 37G > > $ filefrag file.sqlite > (Ctrl-C after waiting more than 10 minutes, consuming 100% CPU) > > ---- > > Manual defragmentation decreased the file's size by 7 GB: > > $ btrfs-defrag file.sqlite > $ sync > $ compsize file.sqlite > Processed 6 files, 13074 regular extents (20260 refs), 0 inline. > Type Perc Disk Usage Uncompressed Referenced > TOTAL 100% 40G 40G 37G > none 100% 40G 40G 37G > > ---- > > Sincerely > Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-07 0:48 ` Qu Wenruo @ 2022-03-07 2:23 ` Jan Ziak 2022-03-07 2:39 ` Qu Wenruo 0 siblings, 1 reply; 71+ messages in thread From: Jan Ziak @ 2022-03-07 2:23 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs On Mon, Mar 7, 2022 at 1:48 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > On 2022/3/6 23:59, Jan Ziak wrote: > > I would like to report that btrfs in Linux kernel 5.16.12 mounted with > > the autodefrag option wrote 5TB in a single day to a 1TB SSD that is > > about 50% full. > > > > Defragmenting 0.5TB on a drive that is 50% full should write far less than 5TB. > > If using defrag ioctl, that's a good and solid expectation. > > Autodefrag will mark any file which got smaller writes (<64K) for scan. > For smaller extents than 64K, they will be re-dirtied for writeback. The NVMe device has 512-byte sectors, but has another namespace with 4K sectors. Will it help btrfs-autodefrag to reformat the drive to 4K sectors? I expect that it won't help - I am asking just in case my expectation is wrong. > So in theory, if the cleaner is triggered very frequently to do > autodefrag, it can indeed easily amplify the writes. According to usr/bin/glances, the sqlite app is writing less than 1 MB per second to the NVMe device. btrfs's autodefrag write amplification is from the 1 MB/s to approximately 200 MB/s. > Are you using commit= mount option? Which would reduce the commit > interval thus trigger autodefrag more frequently. I am not using commit= mount option. > > CPU utilization on an otherwise idle machine is approximately 600% all > > the time: btrfs-cleaner 100%, kworkers...btrfs 500%. > > The problem is why the CPU usage is at 100% for cleaner. > > Would you please apply this patch on your kernel? > https://patchwork.kernel.org/project/linux-btrfs/patch/bf2635d213e0c85251c4cd0391d8fbf274d7d637.1645705266.git.wqu@suse.com/ > > Then enable the following trace events... I will try to apply the patch, collect the events and post the results. First, I will wait for the sqlite file to gain about 1 million extents, which shouldn't take too long. ---- BTW: "compsize file-with-million-extents" finishes in 0.2 seconds (uses BTRFS_IOC_TREE_SEARCH_V2 ioctl), but "filefrag file-with-million-extents" doesn't finish even after several minutes of time (uses FS_IOC_FIEMAP ioctl - manages to perform only about 5 ioctl syscalls per second - and appears to be slowing down as the value of the "fm_start" ioctl argument grows; e2fsprogs version 1.46.5). It would be nice if filefrag was faster than just a few ioctls per second. ---- Sincerely Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-07 2:23 ` Jan Ziak @ 2022-03-07 2:39 ` Qu Wenruo 2022-03-07 7:31 ` Qu Wenruo 2022-03-08 21:57 ` Jan Ziak 0 siblings, 2 replies; 71+ messages in thread From: Qu Wenruo @ 2022-03-07 2:39 UTC (permalink / raw) To: Jan Ziak; +Cc: linux-btrfs On 2022/3/7 10:23, Jan Ziak wrote: > On Mon, Mar 7, 2022 at 1:48 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >> On 2022/3/6 23:59, Jan Ziak wrote: >>> I would like to report that btrfs in Linux kernel 5.16.12 mounted with >>> the autodefrag option wrote 5TB in a single day to a 1TB SSD that is >>> about 50% full. >>> >>> Defragmenting 0.5TB on a drive that is 50% full should write far less than 5TB. >> >> If using defrag ioctl, that's a good and solid expectation. >> >> Autodefrag will mark any file which got smaller writes (<64K) for scan. >> For smaller extents than 64K, they will be re-dirtied for writeback. > > The NVMe device has 512-byte sectors, but has another namespace with > 4K sectors. Will it help btrfs-autodefrag to reformat the drive to 4K > sectors? I expect that it won't help - I am asking just in case my > expectation is wrong. The minimal sector size of btrfs is 4K, so I don't believe it would cause any difference. > >> So in theory, if the cleaner is triggered very frequently to do >> autodefrag, it can indeed easily amplify the writes. > > According to usr/bin/glances, the sqlite app is writing less than 1 MB > per second to the NVMe device. btrfs's autodefrag write amplification > is from the 1 MB/s to approximately 200 MB/s. This is definitely something wrong. Autodefrag by default should only get triggered every 300s, thus even all new bytes are re-dirtied, it should only cause a less than 300M write burst every 300s, not a consistent write. > >> Are you using commit= mount option? Which would reduce the commit >> interval thus trigger autodefrag more frequently. > > I am not using commit= mount option. > >>> CPU utilization on an otherwise idle machine is approximately 600% all >>> the time: btrfs-cleaner 100%, kworkers...btrfs 500%. >> >> The problem is why the CPU usage is at 100% for cleaner. >> >> Would you please apply this patch on your kernel? >> https://patchwork.kernel.org/project/linux-btrfs/patch/bf2635d213e0c85251c4cd0391d8fbf274d7d637.1645705266.git.wqu@suse.com/ >> >> Then enable the following trace events... > > I will try to apply the patch, collect the events and post the > results. First, I will wait for the sqlite file to gain about 1 > million extents, which shouldn't take too long. Thank you very much for the future trace events log. That would be the determining data for us to solve it. > > ---- > > BTW: "compsize file-with-million-extents" finishes in 0.2 seconds > (uses BTRFS_IOC_TREE_SEARCH_V2 ioctl), but "filefrag > file-with-million-extents" doesn't finish even after several minutes > of time (uses FS_IOC_FIEMAP ioctl - manages to perform only about 5 > ioctl syscalls per second - and appears to be slowing down as the > value of the "fm_start" ioctl argument grows; e2fsprogs version > 1.46.5). It would be nice if filefrag was faster than just a few > ioctls per second. This is mostly a race with autodefrag. Both are using file extent map, thus if autodefrag is still trying to redirty the file again and again, it would definitely cause problems for anything also using file extent map. Thanks, Qu > > ---- > > Sincerely > Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-07 2:39 ` Qu Wenruo @ 2022-03-07 7:31 ` Qu Wenruo 2022-03-10 1:10 ` Jan Ziak 2022-03-08 21:57 ` Jan Ziak 1 sibling, 1 reply; 71+ messages in thread From: Qu Wenruo @ 2022-03-07 7:31 UTC (permalink / raw) To: Jan Ziak; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 3611 bytes --] On 2022/3/7 10:39, Qu Wenruo wrote: > > > On 2022/3/7 10:23, Jan Ziak wrote: >> On Mon, Mar 7, 2022 at 1:48 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >>> On 2022/3/6 23:59, Jan Ziak wrote: >>>> I would like to report that btrfs in Linux kernel 5.16.12 mounted with >>>> the autodefrag option wrote 5TB in a single day to a 1TB SSD that is >>>> about 50% full. >>>> >>>> Defragmenting 0.5TB on a drive that is 50% full should write far >>>> less than 5TB. >>> >>> If using defrag ioctl, that's a good and solid expectation. >>> >>> Autodefrag will mark any file which got smaller writes (<64K) for scan. >>> For smaller extents than 64K, they will be re-dirtied for writeback. >> >> The NVMe device has 512-byte sectors, but has another namespace with >> 4K sectors. Will it help btrfs-autodefrag to reformat the drive to 4K >> sectors? I expect that it won't help - I am asking just in case my >> expectation is wrong. > > The minimal sector size of btrfs is 4K, so I don't believe it would > cause any difference. > >> >>> So in theory, if the cleaner is triggered very frequently to do >>> autodefrag, it can indeed easily amplify the writes. >> >> According to usr/bin/glances, the sqlite app is writing less than 1 MB >> per second to the NVMe device. btrfs's autodefrag write amplification >> is from the 1 MB/s to approximately 200 MB/s. > > This is definitely something wrong. > > Autodefrag by default should only get triggered every 300s, thus even > all new bytes are re-dirtied, it should only cause a less than 300M > write burst every 300s, not a consistent write. > >> >>> Are you using commit= mount option? Which would reduce the commit >>> interval thus trigger autodefrag more frequently. >> >> I am not using commit= mount option. >> >>>> CPU utilization on an otherwise idle machine is approximately 600% all >>>> the time: btrfs-cleaner 100%, kworkers...btrfs 500%. >>> >>> The problem is why the CPU usage is at 100% for cleaner. >>> >>> Would you please apply this patch on your kernel? >>> https://patchwork.kernel.org/project/linux-btrfs/patch/bf2635d213e0c85251c4cd0391d8fbf274d7d637.1645705266.git.wqu@suse.com/ >>> >>> >>> Then enable the following trace events... >> >> I will try to apply the patch, collect the events and post the >> results. First, I will wait for the sqlite file to gain about 1 >> million extents, which shouldn't take too long. > > Thank you very much for the future trace events log. > > That would be the determining data for us to solve it. Forgot to mention that, that patch itself relies on refactors in the previous patches. Thus you may want to apply the whole patchset. Or use the attached diff which I manually backported for v5.16.12. Thanks, Qu > >> >> ---- >> >> BTW: "compsize file-with-million-extents" finishes in 0.2 seconds >> (uses BTRFS_IOC_TREE_SEARCH_V2 ioctl), but "filefrag >> file-with-million-extents" doesn't finish even after several minutes >> of time (uses FS_IOC_FIEMAP ioctl - manages to perform only about 5 >> ioctl syscalls per second - and appears to be slowing down as the >> value of the "fm_start" ioctl argument grows; e2fsprogs version >> 1.46.5). It would be nice if filefrag was faster than just a few >> ioctls per second. > > This is mostly a race with autodefrag. > > Both are using file extent map, thus if autodefrag is still trying to > redirty the file again and again, it would definitely cause problems for > anything also using file extent map. > > Thanks, > Qu >> >> ---- >> >> Sincerely >> Jan [-- Attachment #2: 0001-btrfs-add-trace-events-for-defrag.patch --] [-- Type: text/x-patch, Size: 8043 bytes --] From 757bf0aa39c44fc7c3e8e57f1c785ab6c7cffa8a Mon Sep 17 00:00:00 2001 Message-Id: <757bf0aa39c44fc7c3e8e57f1c785ab6c7cffa8a.1646638257.git.wqu@suse.com> From: Qu Wenruo <wqu@suse.com> Date: Sun, 13 Feb 2022 14:19:20 +0800 Subject: [PATCH] btrfs: add trace events for defrag This is the backport for v5.16.12, without the dependency on the btrfs_defrag_ctrl refactor. This patch will introduce the following trace events: - trace_defrag_add_target() - trace_defrag_one_locked_range() - trace_defrag_file_start() - trace_defrag_file_end() Under most cases, all of them are needed to debug policy related defrag bugs. The example output would look like this: (with TASK, CPU, TIMESTAMP and UUID skipped) defrag_file_start: <UUID>: root=5 ino=257 start=0 len=131072 extent_thresh=262144 newer_than=7 flags=0x0 compress=0 max_sectors_to_defrag=1024 defrag_add_target: <UUID>: root=5 ino=257 target_start=0 target_len=4096 found em=0 len=4096 generation=7 defrag_add_target: <UUID>: root=5 ino=257 target_start=4096 target_len=4096 found em=4096 len=4096 generation=7 ... defrag_add_target: <UUID>: root=5 ino=257 target_start=57344 target_len=4096 found em=57344 len=4096 generation=7 defrag_add_target: <UUID>: root=5 ino=257 target_start=61440 target_len=4096 found em=61440 len=4096 generation=7 defrag_add_target: <UUID>: root=5 ino=257 target_start=0 target_len=4096 found em=0 len=4096 generation=7 defrag_add_target: <UUID>: root=5 ino=257 target_start=4096 target_len=4096 found em=4096 len=4096 generation=7 ... defrag_add_target: <UUID>: root=5 ino=257 target_start=57344 target_len=4096 found em=57344 len=4096 generation=7 defrag_add_target: <UUID>: root=5 ino=257 target_start=61440 target_len=4096 found em=61440 len=4096 generation=7 defrag_one_locked_range: <UUID>: root=5 ino=257 start=0 len=65536 defrag_file_end: <UUID>: root=5 ino=257 sectors_defragged=16 last_scanned=131072 ret=0 Although the defrag_add_target() part is lengthy, it shows some details of the extent map we get. With the extra info from defrag_file_start(), we can check if the target em is correct for our defrag policy. Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/ioctl.c | 6 ++ include/trace/events/btrfs.h | 128 +++++++++++++++++++++++++++++++++++ 2 files changed, 134 insertions(+) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 541a4fbfd79e..622d10ac3e97 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1272,6 +1272,7 @@ static int defrag_collect_targets(struct btrfs_inode *inode, add: last_is_target = true; range_len = min(extent_map_end(em), start + len) - cur; + trace_defrag_add_target(inode, em, cur, range_len); /* * This one is a good target, check if it can be merged into * last range of the target list. @@ -1366,6 +1367,7 @@ static int defrag_one_locked_target(struct btrfs_inode *inode, ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start, len); if (ret < 0) return ret; + trace_defrag_one_locked_range(inode, start, (u32)len); clear_extent_bit(&inode->io_tree, start, start + len - 1, EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 0, 0, cached_state); @@ -1591,6 +1593,9 @@ int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra, /* Align the range */ cur = round_down(range->start, fs_info->sectorsize); last_byte = round_up(last_byte, fs_info->sectorsize) - 1; + trace_defrag_file_start(BTRFS_I(inode), cur, last_byte + 1 - cur, + extent_thresh, newer_than, max_to_defrag, + range->flags, range->compress_type); /* * If we were not given a ra, allocate a readahead context. As @@ -1690,6 +1695,7 @@ int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra, BTRFS_I(inode)->defrag_compress = BTRFS_COMPRESS_NONE; btrfs_inode_unlock(inode, 0); } + trace_defrag_file_end(BTRFS_I(inode), ret, sectors_defragged, cur); return ret; } diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index 8f58fd95efc7..98eb8f4a04c6 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -2263,6 +2263,134 @@ DEFINE_EVENT(btrfs__space_info_update, update_bytes_pinned, TP_ARGS(fs_info, sinfo, old, diff) ); +TRACE_EVENT(defrag_one_locked_range, + + TP_PROTO(const struct btrfs_inode *inode, u64 start, u32 len), + + TP_ARGS(inode, start, len), + + TP_STRUCT__entry_btrfs( + __field( u64, root ) + __field( u64, ino ) + __field( u64, start ) + __field( u32, len ) + ), + + TP_fast_assign_btrfs(inode->root->fs_info, + __entry->root = inode->root->root_key.objectid; + __entry->ino = btrfs_ino(inode); + __entry->start = start; + __entry->len = len; + ), + + TP_printk_btrfs("root=%llu ino=%llu start=%llu len=%u", + __entry->root, __entry->ino, __entry->start, __entry->len) +); + +TRACE_EVENT(defrag_add_target, + + TP_PROTO(const struct btrfs_inode *inode, const struct extent_map *em, + u64 start, u32 len), + + TP_ARGS(inode, em, start, len), + + TP_STRUCT__entry_btrfs( + __field( u64, root ) + __field( u64, ino ) + __field( u64, target_start ) + __field( u32, target_len ) + __field( u64, em_generation ) + __field( u64, em_start ) + __field( u64, em_len ) + ), + + TP_fast_assign_btrfs(inode->root->fs_info, + __entry->root = inode->root->root_key.objectid; + __entry->ino = btrfs_ino(inode); + __entry->target_start = start; + __entry->target_len = len; + __entry->em_generation = em->generation; + __entry->em_start = em->start; + __entry->em_len = em->len; + ), + + TP_printk_btrfs("root=%llu ino=%llu target_start=%llu target_len=%u " + "found em=%llu len=%llu generation=%llu", + __entry->root, __entry->ino, __entry->target_start, + __entry->target_len, __entry->em_start, __entry->em_len, + __entry->em_generation) +); + +TRACE_EVENT(defrag_file_start, + + TP_PROTO(const struct btrfs_inode *inode, + u64 start, u64 len, u32 extent_thresh, u64 newer_than, + unsigned long max_sectors_to_defrag, u64 flags, u32 compress), + + TP_ARGS(inode, start, len, extent_thresh, newer_than, + max_sectors_to_defrag, flags, compress), + + TP_STRUCT__entry_btrfs( + __field( u64, root ) + __field( u64, ino ) + __field( u64, start ) + __field( u64, len ) + __field( u64, newer_than ) + __field( u64, max_sectors_to_defrag ) + __field( u32, extent_thresh ) + __field( u8, flags ) + __field( u8, compress ) + ), + + TP_fast_assign_btrfs(inode->root->fs_info, + __entry->root = inode->root->root_key.objectid; + __entry->ino = btrfs_ino(inode); + __entry->start = start; + __entry->len = len; + __entry->extent_thresh = extent_thresh; + __entry->newer_than = newer_than; + __entry->max_sectors_to_defrag = max_sectors_to_defrag; + __entry->flags = flags; + __entry->compress = compress; + ), + + TP_printk_btrfs("root=%llu ino=%llu start=%llu len=%llu " + "extent_thresh=%u newer_than=%llu flags=0x%x compress=%u " + "max_sectors_to_defrag=%llu", + __entry->root, __entry->ino, __entry->start, __entry->len, + __entry->extent_thresh, __entry->newer_than, __entry->flags, + __entry->compress, __entry->max_sectors_to_defrag) +); + +TRACE_EVENT(defrag_file_end, + + TP_PROTO(const struct btrfs_inode *inode, + int ret, u64 sectors_defragged, u64 last_scanned), + + TP_ARGS(inode, ret, sectors_defragged, last_scanned), + + TP_STRUCT__entry_btrfs( + __field( u64, root ) + __field( u64, ino ) + __field( u64, sectors_defragged ) + __field( u64, last_scanned ) + __field( int, ret ) + ), + + TP_fast_assign_btrfs(inode->root->fs_info, + __entry->root = inode->root->root_key.objectid; + __entry->ino = btrfs_ino(inode); + __entry->sectors_defragged = sectors_defragged; + __entry->last_scanned = last_scanned; + __entry->ret = ret; + ), + + TP_printk_btrfs("root=%llu ino=%llu sectors_defragged=%llu " + "last_scanned=%llu ret=%d", + __entry->root, __entry->ino, __entry->sectors_defragged, + __entry->last_scanned, __entry->ret) +); + #endif /* _TRACE_BTRFS_H */ /* This part must be outside protection */ -- 2.35.1 ^ permalink raw reply related [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-07 7:31 ` Qu Wenruo @ 2022-03-10 1:10 ` Jan Ziak 2022-03-10 1:26 ` Qu Wenruo 0 siblings, 1 reply; 71+ messages in thread From: Jan Ziak @ 2022-03-10 1:10 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs > Or use the attached diff which I manually backported for v5.16.12. I applied the patch to 5.16.12. It takes about 35 minutes after "mount / -o remount,autodefrag" for btrfs autodefrag to start writing about 200 MB/s to the NVMe drive. $ trace-cmd record -e btrfs:defrag_* The size of the resulting trace.dat file is 4 GB. Please send me some instructions describing how to extract data relevant to the btrfs-autodefrag issue from the trace.dat file. I suppose you don't want the whole trace.dat file. Compressed trace.dat.zstd has size 324 MB. -Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-10 1:10 ` Jan Ziak @ 2022-03-10 1:26 ` Qu Wenruo 2022-03-10 4:33 ` Jan Ziak 0 siblings, 1 reply; 71+ messages in thread From: Qu Wenruo @ 2022-03-10 1:26 UTC (permalink / raw) To: Jan Ziak, Qu Wenruo; +Cc: linux-btrfs On 2022/3/10 09:10, Jan Ziak wrote: >> Or use the attached diff which I manually backported for v5.16.12. > > I applied the patch to 5.16.12. It takes about 35 minutes after "mount > / -o remount,autodefrag" for btrfs autodefrag to start writing about > 200 MB/s to the NVMe drive. > > $ trace-cmd record -e btrfs:defrag_* You can go without using trace-cmd, and use sysfs interface directly. (That's why sometimes trace-cmd is over-complicating things) This would not only reduce the size of the file, but also provide a readable result directly (all need root privilege) cd /sys/kernel/debug/tracing ## To disable and clear current trace buffer and events echo 0 > tracing_on echo > trace echo > set_event ## Reduce per-cpu buffer size in KB, if you don't want a too large ## event buffer echo 64 > buffer_size_kb ## Enable those defrag events: echo "btrfs:defrag_one_locked_range" >> set_event echo "btrfs:defrag_add_target" >> set_event echo "btrfs:defrag_file_start" >> set_event echo "btrfs:defrag_file_end" >> set_event ## Enable trace echo 1 > $tracedir/tracing_on ## After the consistent write happens, just copy the trace file cp /sys/kernel/debug/tracing/trace /tmp/whatevername Thanks, Qu > > The size of the resulting trace.dat file is 4 GB. > > Please send me some instructions describing how to extract data > relevant to the btrfs-autodefrag issue from the trace.dat file. I > suppose you don't want the whole trace.dat file. Compressed > trace.dat.zstd has size 324 MB. > > -Jan > ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-10 1:26 ` Qu Wenruo @ 2022-03-10 4:33 ` Jan Ziak 2022-03-10 6:42 ` Qu Wenruo 0 siblings, 1 reply; 71+ messages in thread From: Jan Ziak @ 2022-03-10 4:33 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 492 bytes --] On Thu, Mar 10, 2022 at 2:26 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > ## Enable trace > > echo 1 > $tracedir/tracing_on > > ## After the consistent write happens, just copy the trace file > > cp /sys/kernel/debug/tracing/trace /tmp/whatevername The compressed trace is attached to this email. Inode 307273 is the 40GB sqlite file, it currently has 1689020 extents. This time, it took about 3 hours after "mount / -o remount,autodefrag" for the issue to start manifesting itself. -Jan [-- Attachment #2: trace.txt.zst --] [-- Type: application/zstd, Size: 86535 bytes --] ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-10 4:33 ` Jan Ziak @ 2022-03-10 6:42 ` Qu Wenruo 2022-03-10 21:31 ` Jan Ziak 0 siblings, 1 reply; 71+ messages in thread From: Qu Wenruo @ 2022-03-10 6:42 UTC (permalink / raw) To: Jan Ziak; +Cc: linux-btrfs On 2022/3/10 12:33, Jan Ziak wrote: > On Thu, Mar 10, 2022 at 2:26 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >> ## Enable trace >> >> echo 1 > $tracedir/tracing_on >> >> ## After the consistent write happens, just copy the trace file >> >> cp /sys/kernel/debug/tracing/trace /tmp/whatevername > > The compressed trace is attached to this email. Inode 307273 is the > 40GB sqlite file, it currently has 1689020 extents. This time, it took > about 3 hours after "mount / -o remount,autodefrag" for the issue to > start manifesting itself. Sorry, considering your sqlite file is so large, there are too many defrag_one_locked_range() and defrag_add_target() calls. And the buffer size is a little too small. Mind to re-take the trace with the following commands? (No need to reboot, it takes effect immediate) cd /sys/kernel/debug/tracing echo 0 > tracing_on echo > trace echo > set_event echo 65536 > buffer_size_kb echo "btrfs:defrag_file_start" >> set_event echo "btrfs:defrag_file_end" >> set_event echo 1 > $tracedir/tracing_on ## After the consistent write happens, just copy the trace file cp /sys/kernel/debug/tracing/trace /tmp/whatevername Thanks, Qu > > -Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-10 6:42 ` Qu Wenruo @ 2022-03-10 21:31 ` Jan Ziak 2022-03-10 23:27 ` Qu Wenruo 0 siblings, 1 reply; 71+ messages in thread From: Jan Ziak @ 2022-03-10 21:31 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs On Thu, Mar 10, 2022 at 7:42 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > The compressed trace is attached to this email. Inode 307273 is the > > 40GB sqlite file, it currently has 1689020 extents. This time, it took > > about 3 hours after "mount / -o remount,autodefrag" for the issue to > > start manifesting itself. > > Sorry, considering your sqlite file is so large, there are too many > defrag_one_locked_range() and defrag_add_target() calls. > > And the buffer size is a little too small. > > Mind to re-take the trace with the following commands? The compressed trace (size: 1.8 MB) can be downloaded from http://atom-symbol.net/f/2022-03-10/btrfs-autodefrag-trace.txt.zst According to compsize: - inode 307273, at the start of the trace: 1783756 regular extents (3045856 refs), 0 inline - inode 307273, at the end of the trace: 1787794 regular extents (3054334 refs), 0 inline - inode 307273, delta: +4038 regular extents (+8478 refs) Approximately 85% of lines in the trace are related to the mentioned inode, which means that btrfs-autodefrag is trying to defragment the file. The main issue, in my opinion, is that the number of extents increased by 4038 despite btrfs's defragmentation attempts. -Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-10 21:31 ` Jan Ziak @ 2022-03-10 23:27 ` Qu Wenruo 2022-03-11 2:42 ` Jan Ziak 0 siblings, 1 reply; 71+ messages in thread From: Qu Wenruo @ 2022-03-10 23:27 UTC (permalink / raw) To: Jan Ziak; +Cc: linux-btrfs On 2022/3/11 05:31, Jan Ziak wrote: > On Thu, Mar 10, 2022 at 7:42 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >>> The compressed trace is attached to this email. Inode 307273 is the >>> 40GB sqlite file, it currently has 1689020 extents. This time, it took >>> about 3 hours after "mount / -o remount,autodefrag" for the issue to >>> start manifesting itself. >> >> Sorry, considering your sqlite file is so large, there are too many >> defrag_one_locked_range() and defrag_add_target() calls. >> >> And the buffer size is a little too small. >> >> Mind to re-take the trace with the following commands? > > The compressed trace (size: 1.8 MB) can be downloaded from > http://atom-symbol.net/f/2022-03-10/btrfs-autodefrag-trace.txt.zst > > According to compsize: > > - inode 307273, at the start of the trace: 1783756 regular extents > (3045856 refs), 0 inline > > - inode 307273, at the end of the trace: 1787794 regular extents > (3054334 refs), 0 inline > > - inode 307273, delta: +4038 regular extents (+8478 refs) The trace results shows a pattern in the beginning, that around every 30s, autodefrag scans that inode once: 67292.784930: defrag_file_start: root=5 ino=307273 start=0 len=42705735680 extent_thresh=65536 67323.655798: defrag_file_start: root=5 ino=307273 start=0 len=42706268160 extent_thresh=65536 67354.126797: defrag_file_start: root=5 ino=307273 start=0 len=42706268160 extent_thresh=65536 67358.865643: defrag_file_start: root=5 ino=307273 start=0 len=42706268160 extent_thresh=65536 67385.190417: defrag_file_start: root=5 ino=307273 start=0 len=42706554880 extent_thresh=65536 67415.960153: defrag_file_start: root=5 ino=307273 start=0 len=42706554880 extent_thresh=65536 67446.798930: defrag_file_start: root=5 ino=307273 start=0 len=42707038208 extent_thresh=65536 This part is the expected behavior. But very soon, the autodefrag is scanning the file again and again in a very short time: 69188.802624: defrag_file_start: root=5 ino=307273 start=0 len=42720563200 extent_thresh=65536 69189.235753: defrag_file_start: root=5 ino=307273 start=0 len=42720563200 extent_thresh=65536 69189.896309: defrag_file_start: root=5 ino=307273 start=0 len=42720563200 extent_thresh=65536 69190.594834: defrag_file_start: root=5 ino=307273 start=0 len=42720563200 extent_thresh=65536 69191.185359: defrag_file_start: root=5 ino=307273 start=0 len=42720563200 extent_thresh=65536 69191.543833: defrag_file_start: root=5 ino=307273 start=0 len=42720563200 extent_thresh=65536 69192.275865: defrag_file_start: root=5 ino=307273 start=0 len=42720563200 extent_thresh=65536 That inode get defragged 7 times in just 5 seconds. There are more similar patterns for the same inode. The unexpected behavior is the same reported by another reporter. (https://github.com/btrfs/linux/issues/423#issuecomment-1062338536) Thus this patch should resolve the repeated defrag behavior: https://patchwork.kernel.org/project/linux-btrfs/patch/318a1bcdabdd1218d631ddb1a6fe1b9ca3b6b529.1646782687.git.wqu@suse.com/ Mind to give it a try? > > Approximately 85% of lines in the trace are related to the mentioned > inode, which means that btrfs-autodefrag is trying to defragment the > file. The main issue, in my opinion, is that the number of extents > increased by 4038 despite btrfs's defragmentation attempts. Well, this is a trade-off between the effectiveness of defrag and IO. Previously we have a larger extent threshold for autodefrag (256K vs 64K now). However that larger threshold will cause even more IO. In the near future (hopefully v5.19), we will introduce more fine tuning for autodefrag (allowing users to specify the autodefrag interval, and target extent threshold). Thanks, Qu > > -Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-10 23:27 ` Qu Wenruo @ 2022-03-11 2:42 ` Jan Ziak 2022-03-11 2:59 ` Qu Wenruo 0 siblings, 1 reply; 71+ messages in thread From: Jan Ziak @ 2022-03-11 2:42 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs On Fri, Mar 11, 2022 at 12:27 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > The unexpected behavior is the same reported by another reporter. > (https://github.com/btrfs/linux/issues/423#issuecomment-1062338536) > > Thus this patch should resolve the repeated defrag behavior: > https://patchwork.kernel.org/project/linux-btrfs/patch/318a1bcdabdd1218d631ddb1a6fe1b9ca3b6b529.1646782687.git.wqu@suse.com/ > > Mind to give it a try? New trace (patched kernel): http://atom-symbol.net/f/2022-03-11/btrfs-autodefrag-trace-patch1.txt.zst $ cat /proc/297/io read_bytes: 217_835_884_544 write_bytes: 319_139_635_200 btrfs-cleaner (pid 297) read 217 GB and wrote 319 GB, but this had no effect on the fragmentation of the file (currently 1810562 extents). The CPU time of btrfs-cleaner is 20m22s. Machine uptime is 3h27m. -Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-11 2:42 ` Jan Ziak @ 2022-03-11 2:59 ` Qu Wenruo 2022-03-11 5:04 ` Jan Ziak 2022-03-14 20:09 ` Phillip Susi 0 siblings, 2 replies; 71+ messages in thread From: Qu Wenruo @ 2022-03-11 2:59 UTC (permalink / raw) To: Jan Ziak; +Cc: linux-btrfs On 2022/3/11 10:42, Jan Ziak wrote: > On Fri, Mar 11, 2022 at 12:27 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >> The unexpected behavior is the same reported by another reporter. >> (https://github.com/btrfs/linux/issues/423#issuecomment-1062338536) >> >> Thus this patch should resolve the repeated defrag behavior: >> https://patchwork.kernel.org/project/linux-btrfs/patch/318a1bcdabdd1218d631ddb1a6fe1b9ca3b6b529.1646782687.git.wqu@suse.com/ >> >> Mind to give it a try? > > New trace (patched kernel): > http://atom-symbol.net/f/2022-03-11/btrfs-autodefrag-trace-patch1.txt.zst Mostly as expected now. A few outliners can also be fixed by a upcoming patch: https://patchwork.kernel.org/project/linux-btrfs/patch/d1ce90f37777987732b8ccf0edbfc961cd5c8873.1646912061.git.wqu@suse.com/ But please note that, the extra patch won't bring a bigger impact as the previous one, it's mostly a small optimization. > > $ cat /proc/297/io > read_bytes: 217_835_884_544 > write_bytes: 319_139_635_200 > > btrfs-cleaner (pid 297) read 217 GB and wrote 319 GB, but this had no > effect on the fragmentation of the file (currently 1810562 extents). That's more or less expected. Autodefrag has two limitations: 1. Only defrag newer writes It doesn't defrag older fragments. This is the existing behavior from the beginning of autodefrag. Thus it's not that effective against small random writes. 2. Small target extent size Only targets writes smaller than 64K. If 1. is the main reason, even if we allow users to specify the autodefrag extent size/interval, it won't help too much for the workload. And I have already submitted patch to btrfs docs, explaining that autodefrag is not really a perfect fit for heavy small random writes. Thanks, Qu > > The CPU time of btrfs-cleaner is 20m22s. Machine uptime is 3h27m. > > -Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-11 2:59 ` Qu Wenruo @ 2022-03-11 5:04 ` Jan Ziak 2022-03-11 16:31 ` Jan Ziak 2022-03-14 20:09 ` Phillip Susi 1 sibling, 1 reply; 71+ messages in thread From: Jan Ziak @ 2022-03-11 5:04 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs On Fri, Mar 11, 2022 at 3:59 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > A few outliners can also be fixed by a upcoming patch: > https://patchwork.kernel.org/project/linux-btrfs/patch/d1ce90f37777987732b8ccf0edbfc961cd5c8873.1646912061.git.wqu@suse.com/ > > But please note that, the extra patch won't bring a bigger impact as the > previous one, it's mostly a small optimization. I will apply and test the patch and report results. -Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-11 5:04 ` Jan Ziak @ 2022-03-11 16:31 ` Jan Ziak 2022-03-11 20:02 ` Jan Ziak 2022-03-11 23:04 ` Qu Wenruo 0 siblings, 2 replies; 71+ messages in thread From: Jan Ziak @ 2022-03-11 16:31 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs On Fri, Mar 11, 2022 at 6:04 AM Jan Ziak <0xe2.0x9a.0x9b@gmail.com> wrote: > > On Fri, Mar 11, 2022 at 3:59 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > A few outliners can also be fixed by a upcoming patch: > > https://patchwork.kernel.org/project/linux-btrfs/patch/d1ce90f37777987732b8ccf0edbfc961cd5c8873.1646912061.git.wqu@suse.com/ > > > > But please note that, the extra patch won't bring a bigger impact as the > > previous one, it's mostly a small optimization. > > I will apply and test the patch and report results. $ uptime 10h54m CPU time of pid 297: 1h48m $ cat /proc/297/io (pid 297 is btrfs-cleaner) read_bytes: 4_433_081_716_736 write_bytes: 788_509_859_840 file.sqlite, before 10h54m: 1827586 extents file.sqlite, after 10h54m: 1876144 extents Summary: File fragmentation increased by 48558 extents despite the fact that btrfs-cleaner read 4.4 TB, wrote 788 GB and consumed 1h48m of CPU time. If it helps, I can send you the complete list of all the 1.8 million extents. I am not sure how long it might take to obtain such a list. -Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-11 16:31 ` Jan Ziak @ 2022-03-11 20:02 ` Jan Ziak 2022-03-11 23:04 ` Qu Wenruo 1 sibling, 0 replies; 71+ messages in thread From: Jan Ziak @ 2022-03-11 20:02 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 877 bytes --] On Fri, Mar 11, 2022 at 5:31 PM Jan Ziak <0xe2.0x9a.0x9b@gmail.com> wrote: > If it helps, I can send you the complete list of all the 1.8 million > extents. I am not sure how long it might take to obtain such a list. It takes only a couple of minutes to get all extents by using xfs_io. A text file containing the histogram of all the file's extents is attached to this email. I suggest we do the following steps: 1. Make a snapshot of the file's extents (extents1.txt) 2. Enable btrfs autodefrag, enable tracing 3.1. Make a snapshot of the file's extents (extents2.txt) 3.2. Save /sys/kernel/debug/tracing/trace (trace.txt) 4. Compute the difference between extents1.txt and extents2.txt 5. Compare extents-diff.txt and trace.txt, in order to determine why the number of extents is increasing over time despite btrfs-autodefrag's attempts to defragment the file -Jan [-- Attachment #2: extents-histogram.txt --] [-- Type: text/plain, Size: 22650 bytes --] xfs_io -c "fiemap -v 0g 100g": 3194386 extents 1 block = 512 bytes 8 blocks: 2216051 instances 16 blocks: 302377 instances 24 blocks: 189380 instances 32 blocks: 117976 instances 40 blocks: 80658 instances 48 blocks: 58107 instances 56 blocks: 42171 instances 64 blocks: 31595 instances 72 blocks: 24612 instances 80 blocks: 19104 instances 88 blocks: 15051 instances 96 blocks: 12393 instances 104 blocks: 10114 instances 112 blocks: 8555 instances 120 blocks: 7147 instances 128 blocks: 5910 instances 136 blocks: 4991 instances 144 blocks: 4187 instances 152 blocks: 3488 instances 160 blocks: 2928 instances 168 blocks: 2499 instances 176 blocks: 2184 instances 184 blocks: 1816 instances 192 blocks: 1611 instances 200 blocks: 1416 instances 208 blocks: 1242 instances 216 blocks: 1079 instances 224 blocks: 938 instances 232 blocks: 809 instances 240 blocks: 766 instances 248 blocks: 669 instances 256 blocks: 614 instances 264 blocks: 571 instances 272 blocks: 528 instances 280 blocks: 494 instances 288 blocks: 423 instances 296 blocks: 383 instances 304 blocks: 359 instances 312 blocks: 365 instances 320 blocks: 310 instances 328 blocks: 306 instances 336 blocks: 241 instances 344 blocks: 271 instances 352 blocks: 260 instances 360 blocks: 289 instances 368 blocks: 278 instances 376 blocks: 244 instances 384 blocks: 232 instances 392 blocks: 262 instances 400 blocks: 202 instances 408 blocks: 212 instances 416 blocks: 218 instances 424 blocks: 215 instances 432 blocks: 192 instances 440 blocks: 209 instances 448 blocks: 198 instances 456 blocks: 195 instances 464 blocks: 176 instances 472 blocks: 206 instances 480 blocks: 172 instances 488 blocks: 178 instances 496 blocks: 177 instances 504 blocks: 158 instances 512 blocks: 181 instances 520 blocks: 174 instances 528 blocks: 152 instances 536 blocks: 146 instances 544 blocks: 146 instances 552 blocks: 151 instances 560 blocks: 145 instances 568 blocks: 158 instances 576 blocks: 136 instances 584 blocks: 133 instances 592 blocks: 140 instances 600 blocks: 131 instances 608 blocks: 123 instances 616 blocks: 122 instances 624 blocks: 113 instances 632 blocks: 133 instances 640 blocks: 118 instances 648 blocks: 139 instances 656 blocks: 137 instances 664 blocks: 128 instances 672 blocks: 119 instances 680 blocks: 101 instances 688 blocks: 98 instances 696 blocks: 116 instances 704 blocks: 101 instances 712 blocks: 119 instances 720 blocks: 106 instances 728 blocks: 107 instances 736 blocks: 95 instances 744 blocks: 118 instances 752 blocks: 97 instances 760 blocks: 110 instances 768 blocks: 109 instances 776 blocks: 89 instances 784 blocks: 96 instances 792 blocks: 79 instances 800 blocks: 90 instances 808 blocks: 81 instances 816 blocks: 88 instances 824 blocks: 87 instances 832 blocks: 86 instances 840 blocks: 81 instances 848 blocks: 93 instances 856 blocks: 90 instances 864 blocks: 85 instances 872 blocks: 75 instances 880 blocks: 79 instances 888 blocks: 92 instances 896 blocks: 92 instances 904 blocks: 73 instances 912 blocks: 87 instances 920 blocks: 89 instances 928 blocks: 93 instances 936 blocks: 90 instances 944 blocks: 85 instances 952 blocks: 93 instances 960 blocks: 62 instances 968 blocks: 66 instances 976 blocks: 65 instances 984 blocks: 86 instances 992 blocks: 76 instances 1000 blocks: 77 instances 1008 blocks: 62 instances 1016 blocks: 66 instances 1024 blocks: 74 instances 1032 blocks: 85 instances 1040 blocks: 70 instances 1048 blocks: 56 instances 1056 blocks: 84 instances 1064 blocks: 64 instances 1072 blocks: 64 instances 1080 blocks: 72 instances 1088 blocks: 53 instances 1096 blocks: 59 instances 1104 blocks: 50 instances 1112 blocks: 46 instances 1120 blocks: 49 instances 1128 blocks: 68 instances 1136 blocks: 59 instances 1144 blocks: 57 instances 1152 blocks: 53 instances 1160 blocks: 52 instances 1168 blocks: 46 instances 1176 blocks: 55 instances 1184 blocks: 50 instances 1192 blocks: 60 instances 1200 blocks: 47 instances 1208 blocks: 38 instances 1216 blocks: 61 instances 1224 blocks: 56 instances 1232 blocks: 66 instances 1240 blocks: 57 instances 1248 blocks: 45 instances 1256 blocks: 56 instances 1264 blocks: 47 instances 1272 blocks: 46 instances 1280 blocks: 50 instances 1288 blocks: 45 instances 1296 blocks: 44 instances 1304 blocks: 53 instances 1312 blocks: 45 instances 1320 blocks: 61 instances 1328 blocks: 42 instances 1336 blocks: 46 instances 1344 blocks: 52 instances 1352 blocks: 50 instances 1360 blocks: 46 instances 1368 blocks: 39 instances 1376 blocks: 53 instances 1384 blocks: 41 instances 1392 blocks: 55 instances 1400 blocks: 33 instances 1408 blocks: 46 instances 1416 blocks: 29 instances 1424 blocks: 34 instances 1432 blocks: 41 instances 1440 blocks: 30 instances 1448 blocks: 38 instances 1456 blocks: 29 instances 1464 blocks: 34 instances 1472 blocks: 34 instances 1480 blocks: 31 instances 1488 blocks: 38 instances 1496 blocks: 34 instances 1504 blocks: 41 instances 1512 blocks: 34 instances 1520 blocks: 25 instances 1528 blocks: 24 instances 1536 blocks: 32 instances 1544 blocks: 32 instances 1552 blocks: 24 instances 1560 blocks: 27 instances 1568 blocks: 29 instances 1576 blocks: 29 instances 1584 blocks: 24 instances 1592 blocks: 26 instances 1600 blocks: 33 instances 1608 blocks: 29 instances 1616 blocks: 32 instances 1624 blocks: 33 instances 1632 blocks: 32 instances 1640 blocks: 24 instances 1648 blocks: 25 instances 1656 blocks: 20 instances 1664 blocks: 23 instances 1672 blocks: 28 instances 1680 blocks: 20 instances 1688 blocks: 30 instances 1696 blocks: 27 instances 1704 blocks: 16 instances 1712 blocks: 31 instances 1720 blocks: 20 instances 1728 blocks: 19 instances 1736 blocks: 20 instances 1744 blocks: 22 instances 1752 blocks: 19 instances 1760 blocks: 24 instances 1768 blocks: 23 instances 1776 blocks: 23 instances 1784 blocks: 25 instances 1792 blocks: 21 instances 1800 blocks: 17 instances 1808 blocks: 21 instances 1816 blocks: 24 instances 1824 blocks: 14 instances 1832 blocks: 11 instances 1840 blocks: 21 instances 1848 blocks: 19 instances 1856 blocks: 18 instances 1864 blocks: 9 instances 1872 blocks: 19 instances 1880 blocks: 16 instances 1888 blocks: 19 instances 1896 blocks: 15 instances 1904 blocks: 22 instances 1912 blocks: 18 instances 1920 blocks: 14 instances 1928 blocks: 10 instances 1936 blocks: 15 instances 1944 blocks: 17 instances 1952 blocks: 19 instances 1960 blocks: 16 instances 1968 blocks: 20 instances 1976 blocks: 23 instances 1984 blocks: 19 instances 1992 blocks: 19 instances 2000 blocks: 17 instances 2008 blocks: 15 instances 2016 blocks: 14 instances 2024 blocks: 15 instances 2032 blocks: 25 instances 2040 blocks: 17 instances 2048 blocks: 15 instances 2056 blocks: 10 instances 2064 blocks: 13 instances 2072 blocks: 29 instances 2080 blocks: 16 instances 2088 blocks: 14 instances 2096 blocks: 17 instances 2104 blocks: 12 instances 2112 blocks: 13 instances 2120 blocks: 17 instances 2128 blocks: 19 instances 2136 blocks: 14 instances 2144 blocks: 12 instances 2152 blocks: 9 instances 2160 blocks: 7 instances 2168 blocks: 8 instances 2176 blocks: 17 instances 2184 blocks: 10 instances 2192 blocks: 13 instances 2200 blocks: 14 instances 2208 blocks: 17 instances 2216 blocks: 14 instances 2224 blocks: 14 instances 2232 blocks: 11 instances 2240 blocks: 14 instances 2248 blocks: 10 instances 2256 blocks: 11 instances 2264 blocks: 14 instances 2272 blocks: 20 instances 2280 blocks: 10 instances 2288 blocks: 7 instances 2296 blocks: 13 instances 2304 blocks: 9 instances 2312 blocks: 13 instances 2320 blocks: 9 instances 2328 blocks: 6 instances 2336 blocks: 10 instances 2344 blocks: 11 instances 2352 blocks: 11 instances 2360 blocks: 11 instances 2368 blocks: 6 instances 2376 blocks: 17 instances 2384 blocks: 10 instances 2392 blocks: 7 instances 2400 blocks: 10 instances 2408 blocks: 5 instances 2416 blocks: 9 instances 2424 blocks: 9 instances 2432 blocks: 11 instances 2440 blocks: 14 instances 2448 blocks: 12 instances 2456 blocks: 16 instances 2464 blocks: 10 instances 2472 blocks: 9 instances 2480 blocks: 7 instances 2488 blocks: 6 instances 2496 blocks: 11 instances 2504 blocks: 13 instances 2512 blocks: 9 instances 2520 blocks: 8 instances 2528 blocks: 6 instances 2536 blocks: 14 instances 2544 blocks: 7 instances 2552 blocks: 9 instances 2560 blocks: 10 instances 2568 blocks: 11 instances 2576 blocks: 7 instances 2584 blocks: 10 instances 2592 blocks: 14 instances 2600 blocks: 15 instances 2608 blocks: 11 instances 2616 blocks: 8 instances 2624 blocks: 6 instances 2632 blocks: 12 instances 2640 blocks: 12 instances 2648 blocks: 5 instances 2656 blocks: 10 instances 2664 blocks: 7 instances 2672 blocks: 9 instances 2680 blocks: 7 instances 2688 blocks: 7 instances 2696 blocks: 9 instances 2704 blocks: 8 instances 2712 blocks: 8 instances 2720 blocks: 9 instances 2728 blocks: 11 instances 2736 blocks: 7 instances 2744 blocks: 5 instances 2752 blocks: 11 instances 2760 blocks: 10 instances 2768 blocks: 4 instances 2776 blocks: 9 instances 2784 blocks: 6 instances 2792 blocks: 6 instances 2800 blocks: 9 instances 2808 blocks: 10 instances 2816 blocks: 9 instances 2824 blocks: 7 instances 2832 blocks: 6 instances 2840 blocks: 12 instances 2848 blocks: 4 instances 2856 blocks: 7 instances 2864 blocks: 7 instances 2872 blocks: 7 instances 2880 blocks: 3 instances 2888 blocks: 6 instances 2896 blocks: 9 instances 2904 blocks: 8 instances 2912 blocks: 3 instances 2920 blocks: 5 instances 2928 blocks: 9 instances 2936 blocks: 6 instances 2944 blocks: 4 instances 2952 blocks: 7 instances 2960 blocks: 3 instances 2968 blocks: 3 instances 2976 blocks: 2 instances 2984 blocks: 3 instances 2992 blocks: 10 instances 3000 blocks: 4 instances 3008 blocks: 4 instances 3016 blocks: 3 instances 3024 blocks: 3 instances 3032 blocks: 1 instances 3040 blocks: 6 instances 3048 blocks: 4 instances 3056 blocks: 10 instances 3064 blocks: 3 instances 3072 blocks: 2 instances 3080 blocks: 8 instances 3088 blocks: 5 instances 3096 blocks: 7 instances 3104 blocks: 2 instances 3112 blocks: 6 instances 3120 blocks: 3 instances 3128 blocks: 6 instances 3136 blocks: 5 instances 3144 blocks: 3 instances 3152 blocks: 2 instances 3160 blocks: 3 instances 3168 blocks: 4 instances 3176 blocks: 6 instances 3184 blocks: 5 instances 3192 blocks: 8 instances 3200 blocks: 3 instances 3208 blocks: 6 instances 3216 blocks: 5 instances 3224 blocks: 3 instances 3232 blocks: 7 instances 3240 blocks: 2 instances 3248 blocks: 5 instances 3256 blocks: 4 instances 3264 blocks: 4 instances 3272 blocks: 3 instances 3288 blocks: 3 instances 3296 blocks: 5 instances 3304 blocks: 6 instances 3312 blocks: 3 instances 3320 blocks: 8 instances 3328 blocks: 1 instances 3336 blocks: 4 instances 3344 blocks: 4 instances 3352 blocks: 6 instances 3360 blocks: 1 instances 3368 blocks: 3 instances 3376 blocks: 3 instances 3384 blocks: 5 instances 3392 blocks: 2 instances 3400 blocks: 1 instances 3408 blocks: 5 instances 3416 blocks: 6 instances 3424 blocks: 9 instances 3432 blocks: 1 instances 3440 blocks: 1 instances 3448 blocks: 3 instances 3456 blocks: 2 instances 3464 blocks: 4 instances 3472 blocks: 2 instances 3480 blocks: 5 instances 3488 blocks: 2 instances 3496 blocks: 4 instances 3504 blocks: 10 instances 3512 blocks: 2 instances 3520 blocks: 3 instances 3528 blocks: 3 instances 3536 blocks: 1 instances 3544 blocks: 2 instances 3552 blocks: 3 instances 3560 blocks: 2 instances 3568 blocks: 4 instances 3576 blocks: 3 instances 3584 blocks: 3 instances 3592 blocks: 5 instances 3600 blocks: 5 instances 3608 blocks: 3 instances 3616 blocks: 3 instances 3624 blocks: 4 instances 3632 blocks: 4 instances 3640 blocks: 5 instances 3648 blocks: 2 instances 3656 blocks: 4 instances 3664 blocks: 4 instances 3672 blocks: 2 instances 3680 blocks: 4 instances 3688 blocks: 2 instances 3696 blocks: 2 instances 3704 blocks: 1 instances 3712 blocks: 2 instances 3720 blocks: 1 instances 3728 blocks: 3 instances 3736 blocks: 3 instances 3744 blocks: 2 instances 3760 blocks: 6 instances 3768 blocks: 1 instances 3776 blocks: 2 instances 3784 blocks: 1 instances 3792 blocks: 2 instances 3800 blocks: 2 instances 3808 blocks: 3 instances 3816 blocks: 2 instances 3824 blocks: 1 instances 3832 blocks: 3 instances 3840 blocks: 4 instances 3848 blocks: 3 instances 3856 blocks: 2 instances 3864 blocks: 5 instances 3872 blocks: 5 instances 3880 blocks: 3 instances 3888 blocks: 1 instances 3896 blocks: 1 instances 3912 blocks: 2 instances 3920 blocks: 2 instances 3944 blocks: 2 instances 3952 blocks: 3 instances 3968 blocks: 3 instances 3976 blocks: 3 instances 3984 blocks: 2 instances 3992 blocks: 2 instances 4008 blocks: 3 instances 4016 blocks: 1 instances 4024 blocks: 1 instances 4032 blocks: 1 instances 4040 blocks: 1 instances 4048 blocks: 2 instances 4064 blocks: 4 instances 4072 blocks: 4 instances 4088 blocks: 4 instances 4096 blocks: 2 instances 4104 blocks: 4 instances 4112 blocks: 4 instances 4120 blocks: 1 instances 4128 blocks: 1 instances 4136 blocks: 2 instances 4144 blocks: 3 instances 4152 blocks: 1 instances 4160 blocks: 3 instances 4168 blocks: 5 instances 4176 blocks: 1 instances 4184 blocks: 1 instances 4192 blocks: 2 instances 4200 blocks: 1 instances 4208 blocks: 5 instances 4224 blocks: 2 instances 4232 blocks: 2 instances 4240 blocks: 1 instances 4248 blocks: 2 instances 4256 blocks: 6 instances 4272 blocks: 4 instances 4280 blocks: 2 instances 4304 blocks: 3 instances 4312 blocks: 3 instances 4320 blocks: 1 instances 4328 blocks: 2 instances 4336 blocks: 2 instances 4344 blocks: 2 instances 4352 blocks: 1 instances 4360 blocks: 1 instances 4368 blocks: 4 instances 4376 blocks: 1 instances 4392 blocks: 2 instances 4400 blocks: 3 instances 4408 blocks: 2 instances 4416 blocks: 2 instances 4424 blocks: 2 instances 4432 blocks: 1 instances 4440 blocks: 2 instances 4448 blocks: 1 instances 4456 blocks: 1 instances 4464 blocks: 1 instances 4480 blocks: 2 instances 4488 blocks: 2 instances 4496 blocks: 2 instances 4504 blocks: 3 instances 4512 blocks: 3 instances 4520 blocks: 1 instances 4536 blocks: 1 instances 4544 blocks: 4 instances 4552 blocks: 1 instances 4560 blocks: 2 instances 4568 blocks: 2 instances 4584 blocks: 2 instances 4592 blocks: 2 instances 4608 blocks: 2 instances 4624 blocks: 2 instances 4632 blocks: 1 instances 4640 blocks: 2 instances 4656 blocks: 1 instances 4664 blocks: 1 instances 4672 blocks: 1 instances 4680 blocks: 1 instances 4688 blocks: 2 instances 4696 blocks: 1 instances 4712 blocks: 4 instances 4728 blocks: 1 instances 4736 blocks: 1 instances 4744 blocks: 1 instances 4760 blocks: 1 instances 4768 blocks: 1 instances 4776 blocks: 1 instances 4784 blocks: 2 instances 4792 blocks: 4 instances 4800 blocks: 1 instances 4808 blocks: 2 instances 4824 blocks: 3 instances 4832 blocks: 1 instances 4840 blocks: 2 instances 4848 blocks: 2 instances 4856 blocks: 2 instances 4864 blocks: 2 instances 4880 blocks: 1 instances 4896 blocks: 1 instances 4904 blocks: 1 instances 4912 blocks: 2 instances 4944 blocks: 5 instances 4952 blocks: 1 instances 4960 blocks: 2 instances 4968 blocks: 1 instances 4984 blocks: 1 instances 5000 blocks: 1 instances 5008 blocks: 2 instances 5016 blocks: 1 instances 5032 blocks: 1 instances 5040 blocks: 1 instances 5088 blocks: 2 instances 5112 blocks: 1 instances 5120 blocks: 2 instances 5128 blocks: 1 instances 5136 blocks: 3 instances 5152 blocks: 3 instances 5160 blocks: 2 instances 5176 blocks: 1 instances 5184 blocks: 2 instances 5192 blocks: 1 instances 5200 blocks: 2 instances 5208 blocks: 1 instances 5216 blocks: 1 instances 5224 blocks: 2 instances 5232 blocks: 2 instances 5256 blocks: 1 instances 5280 blocks: 1 instances 5288 blocks: 2 instances 5296 blocks: 4 instances 5304 blocks: 1 instances 5320 blocks: 2 instances 5328 blocks: 1 instances 5336 blocks: 1 instances 5344 blocks: 1 instances 5352 blocks: 1 instances 5360 blocks: 1 instances 5376 blocks: 1 instances 5384 blocks: 2 instances 5400 blocks: 1 instances 5408 blocks: 1 instances 5424 blocks: 1 instances 5432 blocks: 4 instances 5456 blocks: 1 instances 5464 blocks: 3 instances 5480 blocks: 1 instances 5504 blocks: 1 instances 5512 blocks: 2 instances 5520 blocks: 1 instances 5536 blocks: 2 instances 5544 blocks: 1 instances 5552 blocks: 1 instances 5576 blocks: 1 instances 5592 blocks: 1 instances 5632 blocks: 1 instances 5640 blocks: 1 instances 5656 blocks: 1 instances 5672 blocks: 1 instances 5680 blocks: 1 instances 5688 blocks: 1 instances 5696 blocks: 1 instances 5736 blocks: 1 instances 5744 blocks: 2 instances 5760 blocks: 1 instances 5768 blocks: 1 instances 5776 blocks: 2 instances 5784 blocks: 2 instances 5792 blocks: 2 instances 5800 blocks: 1 instances 5816 blocks: 1 instances 5888 blocks: 1 instances 5896 blocks: 1 instances 5912 blocks: 2 instances 5920 blocks: 1 instances 5944 blocks: 1 instances 5952 blocks: 1 instances 5992 blocks: 1 instances 6016 blocks: 1 instances 6072 blocks: 2 instances 6088 blocks: 2 instances 6104 blocks: 1 instances 6144 blocks: 2 instances 6152 blocks: 1 instances 6160 blocks: 1 instances 6176 blocks: 1 instances 6200 blocks: 1 instances 6216 blocks: 1 instances 6224 blocks: 1 instances 6232 blocks: 1 instances 6240 blocks: 2 instances 6256 blocks: 2 instances 6264 blocks: 2 instances 6272 blocks: 1 instances 6280 blocks: 1 instances 6296 blocks: 1 instances 6320 blocks: 1 instances 6344 blocks: 1 instances 6392 blocks: 1 instances 6400 blocks: 1 instances 6408 blocks: 3 instances 6416 blocks: 2 instances 6440 blocks: 1 instances 6448 blocks: 2 instances 6480 blocks: 1 instances 6504 blocks: 2 instances 6520 blocks: 1 instances 6584 blocks: 1 instances 6616 blocks: 1 instances 6624 blocks: 1 instances 6640 blocks: 1 instances 6656 blocks: 1 instances 6672 blocks: 1 instances 6704 blocks: 1 instances 6712 blocks: 1 instances 6736 blocks: 2 instances 6768 blocks: 1 instances 6824 blocks: 1 instances 6864 blocks: 1 instances 6904 blocks: 1 instances 6912 blocks: 1 instances 6944 blocks: 2 instances 6952 blocks: 1 instances 6968 blocks: 1 instances 7032 blocks: 1 instances 7072 blocks: 1 instances 7080 blocks: 2 instances 7104 blocks: 1 instances 7144 blocks: 2 instances 7160 blocks: 1 instances 7168 blocks: 1 instances 7176 blocks: 1 instances 7184 blocks: 1 instances 7200 blocks: 1 instances 7216 blocks: 1 instances 7248 blocks: 1 instances 7272 blocks: 1 instances 7352 blocks: 1 instances 7376 blocks: 2 instances 7408 blocks: 1 instances 7416 blocks: 1 instances 7552 blocks: 2 instances 7576 blocks: 1 instances 7592 blocks: 1 instances 7608 blocks: 1 instances 7648 blocks: 1 instances 7680 blocks: 1 instances 7712 blocks: 1 instances 7784 blocks: 2 instances 7856 blocks: 1 instances 7864 blocks: 1 instances 7896 blocks: 1 instances 7928 blocks: 1 instances 8008 blocks: 1 instances 8080 blocks: 1 instances 8176 blocks: 2 instances 8216 blocks: 1 instances 8248 blocks: 1 instances 8336 blocks: 1 instances 8344 blocks: 1 instances 8432 blocks: 1 instances 8520 blocks: 1 instances 8528 blocks: 1 instances 8632 blocks: 1 instances 8712 blocks: 1 instances 8736 blocks: 1 instances 8776 blocks: 1 instances 8816 blocks: 3 instances 8872 blocks: 2 instances 8944 blocks: 1 instances 8992 blocks: 1 instances 9104 blocks: 1 instances 9208 blocks: 1 instances 9448 blocks: 1 instances 9632 blocks: 1 instances 9984 blocks: 1 instances 10152 blocks: 1 instances 10648 blocks: 1 instances 10744 blocks: 1 instances 10896 blocks: 1 instances 10968 blocks: 1 instances 11552 blocks: 1 instances 11800 blocks: 1 instances 11888 blocks: 1 instances 11936 blocks: 1 instances 12184 blocks: 1 instances 12424 blocks: 1 instances 12624 blocks: 1 instances 12728 blocks: 1 instances 12896 blocks: 1 instances 12920 blocks: 1 instances 13120 blocks: 1 instances 13192 blocks: 1 instances 13408 blocks: 1 instances 13672 blocks: 1 instances 13840 blocks: 1 instances 13968 blocks: 1 instances 14480 blocks: 1 instances 14544 blocks: 1 instances 15168 blocks: 1 instances 15448 blocks: 1 instances 15592 blocks: 1 instances 15704 blocks: 1 instances 15824 blocks: 1 instances 16016 blocks: 1 instances 16240 blocks: 1 instances 16320 blocks: 1 instances 16928 blocks: 1 instances 16976 blocks: 1 instances 17440 blocks: 1 instances 20168 blocks: 1 instances 20552 blocks: 1 instances 20872 blocks: 1 instances 21992 blocks: 1 instances 22016 blocks: 1 instances 23256 blocks: 1 instances 23784 blocks: 1 instances 24248 blocks: 1 instances 24736 blocks: 1 instances 24744 blocks: 1 instances 25096 blocks: 1 instances 25504 blocks: 1 instances 25944 blocks: 1 instances 26112 blocks: 1 instances 26368 blocks: 1 instances 26504 blocks: 1 instances 26768 blocks: 1 instances 28280 blocks: 1 instances 29184 blocks: 1 instances 29592 blocks: 1 instances 29632 blocks: 1 instances 31048 blocks: 1 instances 32440 blocks: 1 instances 34608 blocks: 1 instances 35752 blocks: 1 instances 36080 blocks: 1 instances 36464 blocks: 1 instances 38912 blocks: 1 instances 39752 blocks: 1 instances 40448 blocks: 1 instances 42184 blocks: 1 instances 42616 blocks: 1 instances 43576 blocks: 1 instances 44464 blocks: 1 instances 45656 blocks: 1 instances 48432 blocks: 1 instances 48752 blocks: 1 instances 53144 blocks: 1 instances 54120 blocks: 1 instances 55296 blocks: 1 instances 56584 blocks: 1 instances 59216 blocks: 1 instances 60928 blocks: 1 instances 64000 blocks: 1 instances 64624 blocks: 1 instances 65008 blocks: 1 instances 65024 blocks: 13 instances 65088 blocks: 1 instances 65136 blocks: 1 instances 65168 blocks: 1 instances 65184 blocks: 1 instances 65224 blocks: 1 instances 65232 blocks: 1 instances 65272 blocks: 1 instances 65360 blocks: 1 instances 65368 blocks: 1 instances 65384 blocks: 1 instances 65408 blocks: 1 instances 65416 blocks: 1 instances 65440 blocks: 1 instances 65472 blocks: 1 instances 65488 blocks: 1 instances 65496 blocks: 1 instances 65520 blocks: 1 instances 67072 blocks: 2 instances 68608 blocks: 1 instances 69120 blocks: 1 instances 69456 blocks: 1 instances 69720 blocks: 1 instances 83264 blocks: 1 instances 85272 blocks: 1 instances 94696 blocks: 1 instances 99336 blocks: 1 instances 100208 blocks: 1 instances 100448 blocks: 1 instances 104768 blocks: 1 instances 106992 blocks: 1 instances 112664 blocks: 1 instances 118424 blocks: 1 instances 130048 blocks: 2 instances 130208 blocks: 1 instances 134656 blocks: 1 instances 137568 blocks: 1 instances 156384 blocks: 1 instances 162640 blocks: 1 instances 171728 blocks: 1 instances 189216 blocks: 1 instances 195072 blocks: 1 instances 231424 blocks: 1 instances ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-11 16:31 ` Jan Ziak 2022-03-11 20:02 ` Jan Ziak @ 2022-03-11 23:04 ` Qu Wenruo 2022-03-11 23:28 ` Jan Ziak 1 sibling, 1 reply; 71+ messages in thread From: Qu Wenruo @ 2022-03-11 23:04 UTC (permalink / raw) To: Jan Ziak; +Cc: linux-btrfs On 2022/3/12 00:31, Jan Ziak wrote: > On Fri, Mar 11, 2022 at 6:04 AM Jan Ziak <0xe2.0x9a.0x9b@gmail.com> wrote: >> >> On Fri, Mar 11, 2022 at 3:59 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >>> A few outliners can also be fixed by a upcoming patch: >>> https://patchwork.kernel.org/project/linux-btrfs/patch/d1ce90f37777987732b8ccf0edbfc961cd5c8873.1646912061.git.wqu@suse.com/ >>> >>> But please note that, the extra patch won't bring a bigger impact as the >>> previous one, it's mostly a small optimization. >> >> I will apply and test the patch and report results. > > $ uptime > 10h54m > > CPU time of pid 297: 1h48m > > $ cat /proc/297/io (pid 297 is btrfs-cleaner) > read_bytes: 4_433_081_716_736 > write_bytes: 788_509_859_840 > > file.sqlite, before 10h54m: 1827586 extents > > file.sqlite, after 10h54m: 1876144 extents > > Summary: File fragmentation increased by 48558 extents despite the > fact that btrfs-cleaner read 4.4 TB, wrote 788 GB and consumed 1h48m > of CPU time. > > If it helps, I can send you the complete list of all the 1.8 million > extents. I am not sure how long it might take to obtain such a list. As stated before, autodefrag is not really that useful for database. So my primary objective here is to make autodefrag cause less CPU/IO for the worst case scenario. BTW, have you compared to the number of extents with, or without using autodefrag? Thanks, Qu > > -Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-11 23:04 ` Qu Wenruo @ 2022-03-11 23:28 ` Jan Ziak 2022-03-11 23:39 ` Qu Wenruo 2022-03-12 2:43 ` Zygo Blaxell 0 siblings, 2 replies; 71+ messages in thread From: Jan Ziak @ 2022-03-11 23:28 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs On Sat, Mar 12, 2022 at 12:04 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > As stated before, autodefrag is not really that useful for database. Do you realize that you are claiming that btrfs autodefrag should not - by design - be effective in the case of high-fragmentation files? If it isn't supposed to be useful for high-fragmentation files then where is it supposed to be useful? Low-fragmentation files? -Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-11 23:28 ` Jan Ziak @ 2022-03-11 23:39 ` Qu Wenruo 2022-03-12 0:01 ` Jan Ziak 2022-03-12 2:43 ` Zygo Blaxell 1 sibling, 1 reply; 71+ messages in thread From: Qu Wenruo @ 2022-03-11 23:39 UTC (permalink / raw) To: Jan Ziak; +Cc: linux-btrfs On 2022/3/12 07:28, Jan Ziak wrote: > On Sat, Mar 12, 2022 at 12:04 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >> As stated before, autodefrag is not really that useful for database. > > Do you realize that you are claiming that btrfs autodefrag should not > - by design - be effective in the case of high-fragmentation files? Unfortunately, that's exactly what I mean. We all know random writes would cause fragments, but autodefrag is not like regular defrag ioctl, as it only scan newer extents. For example: Our autodefrag is required to defrag writes newer than gen 100, and our inode has the following layout: |---Ext A---|--- Ext B---|---Ext C---|---Ext D---|---Ext E---| Gen 50 Gen 101 Gen 49 Gen 30 Gen 30 Then autodefrag will only try to defrag extent B and extent C. Extent B meets the generation requirement, and is mergable with the next extent C. But all the remaining extents A, D, E will not be defragged as their generations don't meet the requirement. While for regular defrag ioctl, we don't have such generation requirement, and is able to defrag all extents from A to E. (But cause way more IO). Furthermore, autodefrag works by marking the target range dirty, and wait for writeback (and hopefully get more writes near it, so it can get even larger) But if the application, like the database, is calling fsync() frequently, such re-dirtied range is going to writeback almost immediately, without any further chance to get merged larger. Thus the autodefrag effectiveness is almost zero for random writes + frequently fsync(), which is exactly the database workload. > If > it isn't supposed to be useful for high-fragmentation files then where > is it supposed to be useful? Low-fragmentation files? Frequently append writes, or less frequently fsync() calls. Thanks, Qu > > -Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-11 23:39 ` Qu Wenruo @ 2022-03-12 0:01 ` Jan Ziak 2022-03-12 0:15 ` Qu Wenruo 2022-03-12 3:16 ` Zygo Blaxell 0 siblings, 2 replies; 71+ messages in thread From: Jan Ziak @ 2022-03-12 0:01 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs On Sat, Mar 12, 2022 at 12:39 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > On 2022/3/12 07:28, Jan Ziak wrote: > > On Sat, Mar 12, 2022 at 12:04 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > >> As stated before, autodefrag is not really that useful for database. > > > > Do you realize that you are claiming that btrfs autodefrag should not > > - by design - be effective in the case of high-fragmentation files? > > Unfortunately, that's exactly what I mean. > > We all know random writes would cause fragments, but autodefrag is not > like regular defrag ioctl, as it only scan newer extents. > > For example: > > Our autodefrag is required to defrag writes newer than gen 100, and our > inode has the following layout: > > |---Ext A---|--- Ext B---|---Ext C---|---Ext D---|---Ext E---| > Gen 50 Gen 101 Gen 49 Gen 30 Gen 30 > > Then autodefrag will only try to defrag extent B and extent C. > > Extent B meets the generation requirement, and is mergable with the next > extent C. > > But all the remaining extents A, D, E will not be defragged as their > generations don't meet the requirement. > > While for regular defrag ioctl, we don't have such generation > requirement, and is able to defrag all extents from A to E. > (But cause way more IO). > > Furthermore, autodefrag works by marking the target range dirty, and > wait for writeback (and hopefully get more writes near it, so it can get > even larger) > > But if the application, like the database, is calling fsync() > frequently, such re-dirtied range is going to writeback almost > immediately, without any further chance to get merged larger. So, basically, what you are saying is that you are refusing to work together towards fixing/improving the auto-defragmentation algorithm. Based on your decision in this matter, I am now forced either to find a replacement filesystem with features similar to btrfs or to implement a filesystem (where auto-defragmentation works correctly) myself. Since I failed to persuade you that there are serious errors/mistakes in the current btrfs-autodefrag implementation, this is my last email in this whole forum thread. Sincerely Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-12 0:01 ` Jan Ziak @ 2022-03-12 0:15 ` Qu Wenruo 2022-03-12 3:16 ` Zygo Blaxell 1 sibling, 0 replies; 71+ messages in thread From: Qu Wenruo @ 2022-03-12 0:15 UTC (permalink / raw) To: Jan Ziak; +Cc: linux-btrfs On 2022/3/12 08:01, Jan Ziak wrote: > On Sat, Mar 12, 2022 at 12:39 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >> On 2022/3/12 07:28, Jan Ziak wrote: >>> On Sat, Mar 12, 2022 at 12:04 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >>>> As stated before, autodefrag is not really that useful for database. >>> >>> Do you realize that you are claiming that btrfs autodefrag should not >>> - by design - be effective in the case of high-fragmentation files? >> >> Unfortunately, that's exactly what I mean. >> >> We all know random writes would cause fragments, but autodefrag is not >> like regular defrag ioctl, as it only scan newer extents. >> >> For example: >> >> Our autodefrag is required to defrag writes newer than gen 100, and our >> inode has the following layout: >> >> |---Ext A---|--- Ext B---|---Ext C---|---Ext D---|---Ext E---| >> Gen 50 Gen 101 Gen 49 Gen 30 Gen 30 >> >> Then autodefrag will only try to defrag extent B and extent C. >> >> Extent B meets the generation requirement, and is mergable with the next >> extent C. >> >> But all the remaining extents A, D, E will not be defragged as their >> generations don't meet the requirement. >> >> While for regular defrag ioctl, we don't have such generation >> requirement, and is able to defrag all extents from A to E. >> (But cause way more IO). >> >> Furthermore, autodefrag works by marking the target range dirty, and >> wait for writeback (and hopefully get more writes near it, so it can get >> even larger) >> >> But if the application, like the database, is calling fsync() >> frequently, such re-dirtied range is going to writeback almost >> immediately, without any further chance to get merged larger. > > So, basically, what you are saying is that you are refusing to work > together towards fixing/improving the auto-defragmentation algorithm. I'm explaining how autodefrag works, and work to improve autodefrag to handle the worst case scenario. If it doesn't fit your workload, that's unfortunate. There are always cases btrfs can't handle well. > > Based on your decision in this matter, I am now forced either to find > a replacement filesystem with features similar to btrfs or to > implement a filesystem (where auto-defragmentation works correctly) > myself. > > Since I failed to persuade you that there are serious errors/mistakes > in the current btrfs-autodefrag implementation, this is my last email > in this whole forum thread. > > Sincerely > Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-12 0:01 ` Jan Ziak 2022-03-12 0:15 ` Qu Wenruo @ 2022-03-12 3:16 ` Zygo Blaxell 1 sibling, 0 replies; 71+ messages in thread From: Zygo Blaxell @ 2022-03-12 3:16 UTC (permalink / raw) To: Jan Ziak; +Cc: Qu Wenruo, linux-btrfs On Sat, Mar 12, 2022 at 01:01:36AM +0100, Jan Ziak wrote: > On Sat, Mar 12, 2022 at 12:39 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > On 2022/3/12 07:28, Jan Ziak wrote: > > > On Sat, Mar 12, 2022 at 12:04 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > >> As stated before, autodefrag is not really that useful for database. > > > > > > Do you realize that you are claiming that btrfs autodefrag should not > > > - by design - be effective in the case of high-fragmentation files? > > > > Unfortunately, that's exactly what I mean. > > > > We all know random writes would cause fragments, but autodefrag is not > > like regular defrag ioctl, as it only scan newer extents. > > > > For example: > > > > Our autodefrag is required to defrag writes newer than gen 100, and our > > inode has the following layout: > > > > |---Ext A---|--- Ext B---|---Ext C---|---Ext D---|---Ext E---| > > Gen 50 Gen 101 Gen 49 Gen 30 Gen 30 > > > > Then autodefrag will only try to defrag extent B and extent C. > > > > Extent B meets the generation requirement, and is mergable with the next > > extent C. > > > > But all the remaining extents A, D, E will not be defragged as their > > generations don't meet the requirement. > > > > While for regular defrag ioctl, we don't have such generation > > requirement, and is able to defrag all extents from A to E. > > (But cause way more IO). > > > > Furthermore, autodefrag works by marking the target range dirty, and > > wait for writeback (and hopefully get more writes near it, so it can get > > even larger) > > > > But if the application, like the database, is calling fsync() > > frequently, such re-dirtied range is going to writeback almost > > immediately, without any further chance to get merged larger. > > So, basically, what you are saying is that you are refusing to work > together towards fixing/improving the auto-defragmentation algorithm. > > Based on your decision in this matter, I am now forced either to find > a replacement filesystem with features similar to btrfs or to > implement a filesystem (where auto-defragmentation works correctly) > myself. The second of those options is the TL;DR of my previous email, and you don't need to rewrite any part of btrfs except the autodefrag feature. I can answer questions to get you started. You will need to read up on: TREE_SEARCH_V2, the search ioctl. This gives you fast access to new extent refs. You'll need to decode them. The code in btrfs-progs for printing tree items is very useful to see how this is done. INO_PATHS, the resolve-inode-to-path-name ioctl. TREE_SEARCH_V2 will give you inode numbers, but DEFRAG_RANGE needs an open fd. This ioctl is the bridge between them. DEFRAG_RANGE, the defrag ioctl. This defrags a range of a file. The simple daemon model is: - track the filesystem transid every 30 seconds, sleep until it changes - use the TREE_SEARCH_V2 ioctl to find new extent references since the previous transid. See the 'btrfs sub find-new' implementation for details on extracting extent references and filtering by age. This has to be run on every subvol individually, but you can have a daemon for every subvol, or one process that runs this loops over all subvols. - examine extent references to see if they are good candidates for dedupe: not too large or too small, no holes between, etc. This is a replica of the existing kernel algorithm. You can improve on this immediately by running new searches for neighboring extents within optimal defrag range without the transid filter. - ignore bad extent candidates - use INO_PATHS to retrieve the filenames of the inode containing the extent. You can improve on this by filtering filenames of files that are known to have extremely high update rates, or any other criteria that seem useful. - open the file using one of the names, and issue DEFRAG_RANGE to defragment the extents. If you store the last transid persistently (say in a /var file), you can run one iteration of the loop periodically during periods of low sensitivity to IO latency. It doesn't need to run continuously, you can start and stop it at any time depending on need. There are a few gotchas. The main one is that there's an upper bound on optimal extent size in btrfs, as well as a lower bound. Extents that are too large waste space because they cannot be deallocated until the last reference to the last block is overwritten or deleted. So you probably want to stop defragmenting once the extents are 256K or so on a database file, or it will waste a lot of space. Use lower values for heavily active files with random writes, higher values for infrequently modified files. Maximum extent size is 128K for a compressed extent, 128M for uncompressed. > Since I failed to persuade you that there are serious errors/mistakes > in the current btrfs-autodefrag implementation, this is my last email > in this whole forum thread. > Sincerely > Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-11 23:28 ` Jan Ziak 2022-03-11 23:39 ` Qu Wenruo @ 2022-03-12 2:43 ` Zygo Blaxell 2022-03-12 3:24 ` Qu Wenruo 1 sibling, 1 reply; 71+ messages in thread From: Zygo Blaxell @ 2022-03-12 2:43 UTC (permalink / raw) To: Jan Ziak; +Cc: Qu Wenruo, linux-btrfs On Sat, Mar 12, 2022 at 12:28:10AM +0100, Jan Ziak wrote: > On Sat, Mar 12, 2022 at 12:04 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > As stated before, autodefrag is not really that useful for database. > > Do you realize that you are claiming that btrfs autodefrag should not > - by design - be effective in the case of high-fragmentation files? If > it isn't supposed to be useful for high-fragmentation files then where > is it supposed to be useful? Low-fragmentation files? IMHO it's best to deprecate the in-kernel autodefrag option, and start over with a better approach. The kernel is the wrong place to solve this problem, and the undesirable and unfixable things in autodefrag are a consequence of that early design error. As far as I can tell, in-kernel autodefrag's only purpose is to provide exposure to new and exciting bugs on each kernel release, and a lot of uncontrolled IO demands even when it's working perfectly. Inevitably, re-reading old fragments that are no longer in memory will consume RAM and iops during writeback activity, when memory and IO bandwidth is least available. If we avoid expensive re-reading of extents, then we don't get a useful rate of reduction of fragmentation, because we can't coalesce small new exists with small existing ones. If we try to fix these issues one at a time, the feature would inevitably grow a lot of complicated and brittle configuration knobs to turn it off selectively, because it's so awful without extensive filtering. All the above criticism applies to abstract ideal in-kernel autodefrag, _before_ considering whether a concrete implementation might have limitations or bugs which make it worse than the already-bad best case. 5.16 happened to have a lot of examples of these, but fixing the regressions can only restore autodefrag's relative harmlessness, not add utility within the constraints the kernel is under. The right place to do autodefrag is userspace. Interfaces already exist for userspace to 1) discover new extents and their neighbors, quickly and safely, across the entire filesystem; 2) invoke defrag_range on file extent ranges found in step 1; and 3) run a while (true) loop that periodically performs steps 1 and 2. Indeed, the existing kernel autodefrag implementation is already using the same back-end infrastructure for parts 1 and 2, so all that would be required for userspace is to reimplement (and start improving upon) part 3. A command-line utility or daemon can locate new extents immediately with tree_search queries, either at filesystem-wide scales, or directed at user-chosen file subsets. Tools can quickly assess whether new extents are good candidates for defrag, then coalesce them with their neighbors. The user can choose between different tools to decide basic policy questions like: whether to run once in a batch job or continuously in the background, what amounts of IO bandwidth and memory to consume, whether to recompress data with a more aggressive algorithm/level, which reference to a snapshot-shared extent should be preferred for defrag, file-type-specific layout optimizations to apply, or any custom or experimental selection, scheduling, or optimization logic desired. Implementations can be kept simple because it's not necessary for userspace tools to pile every possible option into a single implementation, and support every released option forever (as required for the kernel). A specialist implementation can discard existing code with impunity or start from scratch with an experimental algorithm, and spend its life in a fork of the main userspace autodefrag project with niche users who never have to cope with generic users' use cases and vice versa. This efficiently distributes development and maintenance costs. Userspace autodefrag can be implemented today in any programming language with btrfs ioctl support, and run on any kernel released in the last 6 years. Alas, I don't know of anybody who's released a userspace autodefrag tool yet, and it hasn't been important enough to me to build one myself (other than a few proof-of-concept prototypes). For now, I do defrag mostly ad-hoc with 'btrfs fi defrag' on the most severely fragmented files (top N list of files with the highest extent counts on the filesystem), and ignore fragmentation everywhere else. > -Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-12 2:43 ` Zygo Blaxell @ 2022-03-12 3:24 ` Qu Wenruo 2022-03-12 3:48 ` Zygo Blaxell 0 siblings, 1 reply; 71+ messages in thread From: Qu Wenruo @ 2022-03-12 3:24 UTC (permalink / raw) To: Zygo Blaxell, Jan Ziak; +Cc: linux-btrfs On 2022/3/12 10:43, Zygo Blaxell wrote: > On Sat, Mar 12, 2022 at 12:28:10AM +0100, Jan Ziak wrote: >> On Sat, Mar 12, 2022 at 12:04 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >>> As stated before, autodefrag is not really that useful for database. >> >> Do you realize that you are claiming that btrfs autodefrag should not >> - by design - be effective in the case of high-fragmentation files? If >> it isn't supposed to be useful for high-fragmentation files then where >> is it supposed to be useful? Low-fragmentation files? > > IMHO it's best to deprecate the in-kernel autodefrag option, and start > over with a better approach. The kernel is the wrong place to solve > this problem, and the undesirable and unfixable things in autodefrag > are a consequence of that early design error. I'm having the same feeling exactly. Especially the current autodefrag is putting its own policy (transid filter) without providing a mechanism to utilize from user space. Exactly the opposite what we should do, provide a mechanism not a policy. Not to mention there are quite some limitations of the current policy. But unfortunately, even we deprecate it right now, it will takes a long time to really remove it from kernel. While on the other hand, we also need to introduce new parameters like @newer_than, and @max_to_defrag to the ioctl interface. Which may already eat up the unused bytes (only 16 bytes, while newer_than needs u64, max_to_defrag may also need to be u64). And user space tool lacks one of the critical info, where the small writes are. So even I can't be more happier to deprecate the autodefrag, we still need to hang on it for a pretty lone time, before a user space tool which can do everything the same as autodefrag. Thanks, Qu > > As far as I can tell, in-kernel autodefrag's only purpose is to provide > exposure to new and exciting bugs on each kernel release, and a lot of > uncontrolled IO demands even when it's working perfectly. Inevitably, > re-reading old fragments that are no longer in memory will consume RAM > and iops during writeback activity, when memory and IO bandwidth is least > available. If we avoid expensive re-reading of extents, then we don't > get a useful rate of reduction of fragmentation, because we can't coalesce > small new exists with small existing ones. If we try to fix these issues > one at a time, the feature would inevitably grow a lot of complicated > and brittle configuration knobs to turn it off selectively, because it's > so awful without extensive filtering. > > All the above criticism applies to abstract ideal in-kernel autodefrag, > _before_ considering whether a concrete implementation might have > limitations or bugs which make it worse than the already-bad best case. > 5.16 happened to have a lot of examples of these, but fixing the > regressions can only restore autodefrag's relative harmlessness, not > add utility within the constraints the kernel is under. > > The right place to do autodefrag is userspace. Interfaces already > exist for userspace to 1) discover new extents and their neighbors, > quickly and safely, across the entire filesystem; 2) invoke defrag_range > on file extent ranges found in step 1; and 3) run a while (true) > loop that periodically performs steps 1 and 2. Indeed, the existing > kernel autodefrag implementation is already using the same back-end > infrastructure for parts 1 and 2, so all that would be required for > userspace is to reimplement (and start improving upon) part 3. > > A command-line utility or daemon can locate new extents immediately with > tree_search queries, either at filesystem-wide scales, or directed at > user-chosen file subsets. Tools can quickly assess whether new extents > are good candidates for defrag, then coalesce them with their neighbors. > > The user can choose between different tools to decide basic policy > questions like: whether to run once in a batch job or continuously in > the background, what amounts of IO bandwidth and memory to consume, > whether to recompress data with a more aggressive algorithm/level, which > reference to a snapshot-shared extent should be preferred for defrag, > file-type-specific layout optimizations to apply, or any custom or > experimental selection, scheduling, or optimization logic desired. > > Implementations can be kept simple because it's not necessary for > userspace tools to pile every possible option into a single implementation, > and support every released option forever (as required for the kernel). > A specialist implementation can discard existing code with impunity or > start from scratch with an experimental algorithm, and spend its life > in a fork of the main userspace autodefrag project with niche users > who never have to cope with generic users' use cases and vice versa. > This efficiently distributes development and maintenance costs. > > Userspace autodefrag can be implemented today in any programming language > with btrfs ioctl support, and run on any kernel released in the last > 6 years. Alas, I don't know of anybody who's released a userspace > autodefrag tool yet, and it hasn't been important enough to me to build > one myself (other than a few proof-of-concept prototypes). > > For now, I do defrag mostly ad-hoc with 'btrfs fi defrag' on the most > severely fragmented files (top N list of files with the highest extent > counts on the filesystem), and ignore fragmentation everywhere else. > > >> -Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-12 3:24 ` Qu Wenruo @ 2022-03-12 3:48 ` Zygo Blaxell 0 siblings, 0 replies; 71+ messages in thread From: Zygo Blaxell @ 2022-03-12 3:48 UTC (permalink / raw) To: Qu Wenruo; +Cc: Jan Ziak, linux-btrfs On Sat, Mar 12, 2022 at 11:24:18AM +0800, Qu Wenruo wrote: > > > On 2022/3/12 10:43, Zygo Blaxell wrote: > > On Sat, Mar 12, 2022 at 12:28:10AM +0100, Jan Ziak wrote: > > > On Sat, Mar 12, 2022 at 12:04 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > As stated before, autodefrag is not really that useful for database. > > > > > > Do you realize that you are claiming that btrfs autodefrag should not > > > - by design - be effective in the case of high-fragmentation files? If > > > it isn't supposed to be useful for high-fragmentation files then where > > > is it supposed to be useful? Low-fragmentation files? > > > > IMHO it's best to deprecate the in-kernel autodefrag option, and start > > over with a better approach. The kernel is the wrong place to solve > > this problem, and the undesirable and unfixable things in autodefrag > > are a consequence of that early design error. > > I'm having the same feeling exactly. > > Especially the current autodefrag is putting its own policy (transid > filter) without providing a mechanism to utilize from user space. > > Exactly the opposite what we should do, provide a mechanism not a policy. > > Not to mention there are quite some limitations of the current policy. > > > But unfortunately, even we deprecate it right now, it will takes a long > time to really remove it from kernel. Agree that we have to keep it around until everyone has moved over to the new thing; however, we can stop developing the old thing much sooner, and work on the new thing immediately. > While on the other hand, we also need to introduce new parameters like > @newer_than, and @max_to_defrag to the ioctl interface. > > Which may already eat up the unused bytes (only 16 bytes, while > newer_than needs u64, max_to_defrag may also need to be u64). > > And user space tool lacks one of the critical info, where the small > writes are. Userspace can find new extents pretty fast if it's keeping up with writes in real time. bees scans do a search for all new extents in the last 30 seconds (not just the small ones) and finish in tenths of milliseconds with a hot cache. This is orders of magnitude faster than the actual defragmentation, which has to do all the data IO twice, copy all the modified metadata pages, delayed extent refs, and pay the seek costs for re-reading the fragmented data and writing it somewhere else. The kernel could maintain the list of autodefrag inodes and simply provide them to userspace on demand, but honestly I don't think this list is worth even the tiny amount of memory that it uses. > So even I can't be more happier to deprecate the autodefrag, we still > need to hang on it for a pretty lone time, before a user space tool > which can do everything the same as autodefrag. > > Thanks, > Qu > > > > > As far as I can tell, in-kernel autodefrag's only purpose is to provide > > exposure to new and exciting bugs on each kernel release, and a lot of > > uncontrolled IO demands even when it's working perfectly. Inevitably, > > re-reading old fragments that are no longer in memory will consume RAM > > and iops during writeback activity, when memory and IO bandwidth is least > > available. If we avoid expensive re-reading of extents, then we don't > > get a useful rate of reduction of fragmentation, because we can't coalesce > > small new exists with small existing ones. If we try to fix these issues > > one at a time, the feature would inevitably grow a lot of complicated > > and brittle configuration knobs to turn it off selectively, because it's > > so awful without extensive filtering. > > > > All the above criticism applies to abstract ideal in-kernel autodefrag, > > _before_ considering whether a concrete implementation might have > > limitations or bugs which make it worse than the already-bad best case. > > 5.16 happened to have a lot of examples of these, but fixing the > > regressions can only restore autodefrag's relative harmlessness, not > > add utility within the constraints the kernel is under. > > > > The right place to do autodefrag is userspace. Interfaces already > > exist for userspace to 1) discover new extents and their neighbors, > > quickly and safely, across the entire filesystem; 2) invoke defrag_range > > on file extent ranges found in step 1; and 3) run a while (true) > > loop that periodically performs steps 1 and 2. Indeed, the existing > > kernel autodefrag implementation is already using the same back-end > > infrastructure for parts 1 and 2, so all that would be required for > > userspace is to reimplement (and start improving upon) part 3. > > > > A command-line utility or daemon can locate new extents immediately with > > tree_search queries, either at filesystem-wide scales, or directed at > > user-chosen file subsets. Tools can quickly assess whether new extents > > are good candidates for defrag, then coalesce them with their neighbors. > > > > The user can choose between different tools to decide basic policy > > questions like: whether to run once in a batch job or continuously in > > the background, what amounts of IO bandwidth and memory to consume, > > whether to recompress data with a more aggressive algorithm/level, which > > reference to a snapshot-shared extent should be preferred for defrag, > > file-type-specific layout optimizations to apply, or any custom or > > experimental selection, scheduling, or optimization logic desired. > > > > Implementations can be kept simple because it's not necessary for > > userspace tools to pile every possible option into a single implementation, > > and support every released option forever (as required for the kernel). > > A specialist implementation can discard existing code with impunity or > > start from scratch with an experimental algorithm, and spend its life > > in a fork of the main userspace autodefrag project with niche users > > who never have to cope with generic users' use cases and vice versa. > > This efficiently distributes development and maintenance costs. > > > > Userspace autodefrag can be implemented today in any programming language > > with btrfs ioctl support, and run on any kernel released in the last > > 6 years. Alas, I don't know of anybody who's released a userspace > > autodefrag tool yet, and it hasn't been important enough to me to build > > one myself (other than a few proof-of-concept prototypes). > > > > For now, I do defrag mostly ad-hoc with 'btrfs fi defrag' on the most > > severely fragmented files (top N list of files with the highest extent > > counts on the filesystem), and ignore fragmentation everywhere else. > > > > > > > -Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-11 2:59 ` Qu Wenruo 2022-03-11 5:04 ` Jan Ziak @ 2022-03-14 20:09 ` Phillip Susi 2022-03-14 22:59 ` Zygo Blaxell 1 sibling, 1 reply; 71+ messages in thread From: Phillip Susi @ 2022-03-14 20:09 UTC (permalink / raw) To: Qu Wenruo; +Cc: Jan Ziak, linux-btrfs Qu Wenruo <quwenruo.btrfs@gmx.com> writes: > That's more or less expected. > > Autodefrag has two limitations: > > 1. Only defrag newer writes > It doesn't defrag older fragments. > This is the existing behavior from the beginning of autodefrag. > Thus it's not that effective against small random writes. I don't understand this bit. The whole point of defrag is to reduce the fragmentation of previous writes. New writes should always attempt to follow the previous one if possible. If auto defrag only changes the behavior of new writes, then how does it change it and why is that not the way new writes are always done? ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-14 20:09 ` Phillip Susi @ 2022-03-14 22:59 ` Zygo Blaxell 2022-03-15 18:28 ` Phillip Susi 2022-03-20 17:50 ` Forza 0 siblings, 2 replies; 71+ messages in thread From: Zygo Blaxell @ 2022-03-14 22:59 UTC (permalink / raw) To: Phillip Susi; +Cc: Qu Wenruo, Jan Ziak, linux-btrfs On Mon, Mar 14, 2022 at 04:09:08PM -0400, Phillip Susi wrote: > > Qu Wenruo <quwenruo.btrfs@gmx.com> writes: > > > That's more or less expected. > > > > Autodefrag has two limitations: > > > > 1. Only defrag newer writes > > It doesn't defrag older fragments. > > This is the existing behavior from the beginning of autodefrag. > > Thus it's not that effective against small random writes. > > I don't understand this bit. The whole point of defrag is to reduce the > fragmentation of previous writes. New writes should always attempt to > follow the previous one if possible. New writes are allocated to the first available free space hole large enough to hold them, starting from the point of the last write (plus some other details like clustering and alignment). The goal is that data writes from memory are sequential as much as possible, even if many different files were written in the same transaction. btrfs extents are immutable, so the filesystem can't extend an existing extent with new data. Instead, a new extent must be created that contains both the old and new data to replace the old extent. At least one new fragment must be created whenever the filesystem is modified. (In zoned mode, this is strictly enforced by the underlying hardware.) > If auto defrag only changes the > behavior of new writes, then how does it change it and why is that not > the way new writes are always done? Autodefrag doesn't change write behavior directly. It is a post-processing thread that rereads and rewrites recently written data, _after_ it was originally written to disk. In theory, running defrag after the writes means that the writes can be fast for low latency--they are a physically sequential stream of blocks sent to the disk as fast as it can write them, because btrfs does not have to be concerned with trying to achieve physical contiguity of logically discontiguous data. Later on, when latency is no longer an issue and some IO bandwidth is available, the fragments can be reread and collected together into larger logically and physically contiguous extents by a background process. In practice, autodefrag does only part of that task, badly. Say we have a program that writes 4K to the end of a file, every 5 seconds, for 5 minutes. Every 30 seconds (default commit interval), kernel writeback submits all the dirty pages for writing to btrfs, and in 30 seconds there will be 6 x 4K = 24K of those. An extent in btrfs is created to hold the pages, filled with the data blocks, connected to the various filesystem trees, and flushed out to disk. Over 5 minutes this will happen 10 times, so the file contains 10 fragments, each about 24K (commits are asynchronous, so it might be 20K in one fragment and 28K in the next). After each commit, inodes with new extents are appended to a list in memory. Each list entry contains an inode, a transid of the commit where the first write occurred, and the last defrag offset. That list is processed by a kernel thread some time after the commits are written to disk. The thread searches the inodes for extents created after the last defrag transid, invokes defrag_range on each of these, and advances the offset. If the search offset reaches the end of file, then it is reset to the beginning and another loop is done, and if the next search loop over the file doesn't find new extents then the inode is removed from the defrag list. If there's a 5 minute delay between the original writes and autodefrag finally catching up, then autodefrag will detect 10 new extents and run defrag_range over them. This is a read-then-write operation, since the extent blocks may no longer be present in memory after writeback, so autodefrag can easily fall behind writes if there are a lot of them. Also the 64K size limit kicks in, so it might write 5 extents (2 x 24K = 48K, but 3 x 24K = 72K, and autodefrag cuts off at 64K). If there's a 1 minute delay between the original writes and autodefrag, then autodefrag will detect 1 new extents and run defrag over them for a total of 5 new extents, about 240K each. If there's no delay at all, then there will be 10 extents of 120K each--if autodefrag runs immediately after commit, it will see only one extent in each loop, and issue no defrag_range calls. Seen from the point of view of the disk, there are always at least 10x 120K writes. In the no-autodefrag case it ends there. In the autodefrag cases, some of the data is read and rewritten later to make larger extents. In non-appending cases, the kernel autodefrag doesn't do very much useful at all--random writes aren't logically contiguous, so autodefrag never sees two adjacent extents in a search result, and therefore never sees an opportunity to defrag anything. At the time autodefrag was added to the kernel (May 2011), it was already possible do to a better job in userspace for over a year (Feb 2010). Between 2012 and 2021 there are only a handful of bug fixes, mostly of the form "stop autodefrag from ruining things for the rest of the kernel." ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-14 22:59 ` Zygo Blaxell @ 2022-03-15 18:28 ` Phillip Susi 2022-03-15 19:28 ` Jan Ziak 2022-03-15 21:06 ` Zygo Blaxell 2022-03-20 17:50 ` Forza 1 sibling, 2 replies; 71+ messages in thread From: Phillip Susi @ 2022-03-15 18:28 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Qu Wenruo, Jan Ziak, linux-btrfs Zygo Blaxell <ce3g8jdj@umail.furryterror.org> writes: > btrfs extents are immutable, so the filesystem can't extend an existing > extent with new data. Instead, a new extent must be created that contains > both the old and new data to replace the old extent. At least one new Wait, what? How is an extent immutable? Why isn't a new tree written out with a larger extent and once the transaction commits, bam... you've enlarged your extent? Just like modifying any other data. And do you mean to say that before the new data can be written, the old data must first be read in and moved to the new extent? That seems horridly inefficient. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-15 18:28 ` Phillip Susi @ 2022-03-15 19:28 ` Jan Ziak 2022-03-15 21:06 ` Zygo Blaxell 1 sibling, 0 replies; 71+ messages in thread From: Jan Ziak @ 2022-03-15 19:28 UTC (permalink / raw) To: Phillip Susi; +Cc: Zygo Blaxell, Qu Wenruo, linux-btrfs On Tue, Mar 15, 2022 at 7:34 PM Phillip Susi <phill@thesusis.net> wrote: > Zygo Blaxell <ce3g8jdj@umail.furryterror.org> writes: > > > btrfs extents are immutable, so the filesystem can't extend an existing > > extent with new data. Instead, a new extent must be created that contains > > both the old and new data to replace the old extent. At least one new > > Wait, what? How is an extent immutable? Why isn't a new tree written > out with a larger extent and once the transaction commits, bam... you've > enlarged your extent? Just like modifying any other data. > > And do you mean to say that before the new data can be written, the old > data must first be read in and moved to the new extent? That seems > horridly inefficient. I think, one way of how to make sense of this is that, in btrfs, not just past file-data is immutable due to the fact that it is a CoW filesystem, but also certain parts of the filesystem's meta-data (such as: extents) are immutable as well. Modifying meta-data belonging to a previous (thus, by design, unmodifiable) generation in the btrfs filesystem is somewhat complicated. -Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-15 18:28 ` Phillip Susi 2022-03-15 19:28 ` Jan Ziak @ 2022-03-15 21:06 ` Zygo Blaxell 2022-03-15 22:20 ` Jan Ziak 2022-03-16 18:46 ` Phillip Susi 1 sibling, 2 replies; 71+ messages in thread From: Zygo Blaxell @ 2022-03-15 21:06 UTC (permalink / raw) To: Phillip Susi; +Cc: Qu Wenruo, Jan Ziak, linux-btrfs On Tue, Mar 15, 2022 at 02:28jjjZ:46PM -0400, Phillip Susi wrote: > Zygo Blaxell <ce3g8jdj@umail.furryterror.org> writes: > > > btrfs extents are immutable, so the filesystem can't extend an existing > > extent with new data. Instead, a new extent must be created that contains > > both the old and new data to replace the old extent. At least one new > > Wait, what? How is an extent immutable? Why isn't a new tree written > out with a larger extent and once the transaction commits, bam... you've > enlarged your extent? Just like modifying any other data. If the extent is compressed, you have to write a new extent, because there's no other way to atomically update a compressed extent. If it's reflinked or snapshotted, you can't overwrite the data in place as long as a second reference to the data exists. This is what makes nodatacow and prealloc slow--on every write, they have to check whether the blocks being written are shared or not, and that check is expensive because it's a linear search of every reference for overlapping block ranges, and it can't exit the search early until it has proven there are no shared references. Contrast with datacow, which allocates a new unshared extent that it knows it can write to, and only has to check overwritten extents when they are completely overwritten (and only has to check for the existence of one reference, not enumerate them all). When a file refers to an extent, it refers to the entire extent from the file's subvol tree, even if only a single byte of the extent is contained in the file. There's no mechanism in btrfs extent tree v1 for atomically replacing an extent with separately referenceable objects, and updating all the pointers to parts of the old object to point to the new one. Any such update could cascade into updates across all reflinks and snapshots of the extent, so the write multiplier can be arbitrarily large. There is an extent tree v2 project which provides for splitting uncompressed extents (compressed extents are always immutable) by storing all the overlapping references as objects in the extent tree. It does reference tracking by creating an extent item for every referenced block range, so changing one reference's position or length (e.g. by overwriting or deleting part of an extent reference in a file) doesn't affect any other reference. In theory it could also append to the end of an existing extent, if that case ever came up. That brings us to the next problem: mutable extents won't help with the appending case without also teaching the allocator how to spread out files all over the disk so there's physical space available at file EOF. Normally in btrfs, if you write to 3 files, whatever you wrote is packed into 3 physically contiguous and adjacent extents. If you then want to append to the first or second file, you'll need a new extent, because there's no physical space between the files. > And do you mean to say that before the new data can be written, the old > data must first be read in and moved to the new extent? That seems > horridly inefficient. Normally btrfs doesn't read anything when it writes. New writes create new extents for the new data, and delete only extents that are completely replaced by the new extents. A series of sequential small writes create a lot of small extents, and small extents are sometimes undesirable. Defrag gathers these small extents when they are logically adjacent, reads them into memory, writes a new physically contiguous extent to replace them, then deletes the old extents. Autodefrag is a process that makes defrag happen in near time to extents that were written recently. Defrag isn't the only way to resolve the small-extents issue. If the file is only read once (e.g. a log file that is rotated and compressed with a high-performance compressor like xz) then defrag is a waste of read/write cycles--it's better to leave the small fragments where they are until they are deleted by an application. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-15 21:06 ` Zygo Blaxell @ 2022-03-15 22:20 ` Jan Ziak 2022-03-16 17:02 ` Zygo Blaxell 2022-03-16 18:46 ` Phillip Susi 1 sibling, 1 reply; 71+ messages in thread From: Jan Ziak @ 2022-03-15 22:20 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Phillip Susi, Qu Wenruo, linux-btrfs On Tue, Mar 15, 2022 at 10:06 PM Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > This is what makes > nodatacow and prealloc slow--on every write, they have to check whether > the blocks being written are shared or not, and that check is expensive > because it's a linear search of every reference for overlapping block > ranges, and it can't exit the search early until it has proven there > are no shared references. Contrast with datacow, which allocates a new > unshared extent that it knows it can write to, and only has to check > overwritten extents when they are completely overwritten (and only has > to check for the existence of one reference, not enumerate them all). Some questions: - Linear nodatacow search: Do you mean that write(fd1, buf1, 4096) to a larger nodatacow file is slower compared to write(fd2, buf2, 4096) to a smaller nodatacow file? - Linear nodatacow search: Does the search happen only with uncached metadata, or also with metadata cached in RAM? - Extent tree v2 + nodatacow: V2 also features the linear search (like v1) or has the search been redesigned to be logarithmic? -Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-15 22:20 ` Jan Ziak @ 2022-03-16 17:02 ` Zygo Blaxell 2022-03-16 17:48 ` Jan Ziak 0 siblings, 1 reply; 71+ messages in thread From: Zygo Blaxell @ 2022-03-16 17:02 UTC (permalink / raw) To: Jan Ziak; +Cc: Phillip Susi, Qu Wenruo, linux-btrfs On Tue, Mar 15, 2022 at 11:20:09PM +0100, Jan Ziak wrote: > On Tue, Mar 15, 2022 at 10:06 PM Zygo Blaxell > <ce3g8jdj@umail.furryterror.org> wrote: > > This is what makes > > nodatacow and prealloc slow--on every write, they have to check whether > > the blocks being written are shared or not, and that check is expensive > > because it's a linear search of every reference for overlapping block > > ranges, and it can't exit the search early until it has proven there > > are no shared references. Contrast with datacow, which allocates a new > > unshared extent that it knows it can write to, and only has to check > > overwritten extents when they are completely overwritten (and only has > > to check for the existence of one reference, not enumerate them all). > > Some questions: > > - Linear nodatacow search: Do you mean that write(fd1, buf1, 4096) to > a larger nodatacow file is slower compared to write(fd2, buf2, 4096) > to a smaller nodatacow file? Size doesn't matter, the number and position of references do. It's true that large extents tend to end up with higher average reference counts than small extents, but that's only spurious correlation--the "large extent" and "many references" cases are independent. An 8K nodatacow extent, where the first 4K block has exactly one reference and the second 4K has 32767 references, requires a 32768 times more CPU work to write than a 128M extent with a single reference. In sane cases, there's only one reference to a nodatacow/prealloc extent, because multiple references will turn off nodatacow and multiple writes will turn off prealloc, defeating both features. When there's only one reference, the linear search for overlapping blocks ends quickly. In insane cases (after hole punching, snapshots, reflinks, or writes to prealloc files) there exist multiple references to the extent, each covering distinct byte ranges of the extent. The btrfs trees only index references from leaf metadata pages to the entire extent, so to calculate the number of times an individual block is referenced, we have to iterate over every existing reference to see if it happens to overlap the blocks of interest. That's O(N) in the number of references (roughly--e.g. we don't need to examine different snapshots sharing a metadata page, because every snapshot sharing a metadata page references the same bytes in the data extent, but I don't know if btrfs implements that optimization). We can't simply read the reference count on the extent for various reasons. One is that we don't know what the true reference count is without walking all parent tree nodes toward the root to see if there's a snapshot. The extent is referenced by one metadata page, so its reference count is 1, but the metadata page is shared by multiple tree roots, so the true reference count is higher. Another is that a hole punched into the middle of an extent causes two references from the same file, where each reference covers a distinct set of blocks. None of the individual blocks are shared, but the extent's reference count is 2. > - Linear nodatacow search: Does the search happen only with uncached > metadata, or also with metadata cached in RAM? All metadata is cached in RAM prior to searching. I think I missed where you were going with this question. > - Extent tree v2 + nodatacow: V2 also features the linear search (like > v1) or has the search been redesigned to be logarithmic? I haven't seen the implementation, but the design implies a linear search over the adjacent range of extent physical addresses that is up to 2 * max_extent_len wide. It could be made faster with a clever data structure, which is implied in the project description, but I haven't seen details. There are simple ways to make nodatacow fast, but btrfs doesn't implement them. e.g. nodatacow could be a subvol property, where reflink and snapshot is prohibited over the entire subvol when nodatacow is enabled. That would eliminate the need to ever search extent references on write--nodatacow writes could safely assume everything in the subvol is never shared--and it would match the expectations of people who prefer that nodatacow takes precedence over all incompatible btrfs features. > -Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-16 17:02 ` Zygo Blaxell @ 2022-03-16 17:48 ` Jan Ziak 2022-03-17 2:11 ` Zygo Blaxell 0 siblings, 1 reply; 71+ messages in thread From: Jan Ziak @ 2022-03-16 17:48 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Phillip Susi, Qu Wenruo, linux-btrfs On Wed, Mar 16, 2022 at 6:02 PM Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > On Tue, Mar 15, 2022 at 11:20:09PM +0100, Jan Ziak wrote: > > - Linear nodatacow search: Does the search happen only with uncached > > metadata, or also with metadata cached in RAM? > > All metadata is cached in RAM prior to searching. I think I missed > where you were going with this question. The idea behind the question was whether the on-disk format of metadata differs from the in-memory format of metadata; whether metadata is being transformed when loading/saving it from/to the storage device. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-16 17:48 ` Jan Ziak @ 2022-03-17 2:11 ` Zygo Blaxell 0 siblings, 0 replies; 71+ messages in thread From: Zygo Blaxell @ 2022-03-17 2:11 UTC (permalink / raw) To: Jan Ziak; +Cc: Phillip Susi, Qu Wenruo, linux-btrfs On Wed, Mar 16, 2022 at 06:48:04PM +0100, Jan Ziak wrote: > On Wed, Mar 16, 2022 at 6:02 PM Zygo Blaxell > <ce3g8jdj@umail.furryterror.org> wrote: > > On Tue, Mar 15, 2022 at 11:20:09PM +0100, Jan Ziak wrote: > > > - Linear nodatacow search: Does the search happen only with uncached > > > metadata, or also with metadata cached in RAM? > > > > All metadata is cached in RAM prior to searching. I think I missed > > where you were going with this question. > > The idea behind the question was whether the on-disk format of > metadata differs from the in-memory format of metadata; whether > metadata is being transformed when loading/saving it from/to the > storage device. Both things happen. Metadata reference updates are handled by delayed refs, which track pending reference updates (mostly with the hope of eliminating them entirely, as increment/decrement pairs are common). If these don't cancel out by the end of a transaction, they are turned into metadata page updates. Metadata searches use tree mod log, which is an in-memory version of the history of metadata updates so far in the transaction, since the metadata page buffers themselves will be out of date. Anything not in those caches (including everything committed to disk) is in metadata buffers which are memory buffers in on-disk format. There is a backref cache which is used for relocation, but not the nodatacow/prealloc cases (or normal deletes). Caching doesn't really work for the writing cases since the metadata is changing under the cache. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-15 21:06 ` Zygo Blaxell 2022-03-15 22:20 ` Jan Ziak @ 2022-03-16 18:46 ` Phillip Susi 2022-03-16 19:59 ` Zygo Blaxell 1 sibling, 1 reply; 71+ messages in thread From: Phillip Susi @ 2022-03-16 18:46 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Qu Wenruo, Jan Ziak, linux-btrfs Zygo Blaxell <ce3g8jdj@umail.furryterror.org> writes: > If the extent is compressed, you have to write a new extent, because > there's no other way to atomically update a compressed extent. Right, that makes sense for compression. > If it's reflinked or snapshotted, you can't overwrite the data in place > as long as a second reference to the data exists. This is what makes > nodatacow and prealloc slow--on every write, they have to check whether > the blocks being written are shared or not, and that check is expensive > because it's a linear search of every reference for overlapping block > ranges, and it can't exit the search early until it has proven there > are no shared references. Contrast with datacow, which allocates a new > unshared extent that it knows it can write to, and only has to check > overwritten extents when they are completely overwritten (and only has > to check for the existence of one reference, not enumerate them all). Right, I know you can't overwrite the data in place. What I'm not understanding is why you can't just just write the new data elsewhere and then free the no longer used portion of the old extent. > When a file refers to an extent, it refers to the entire extent from the > file's subvol tree, even if only a single byte of the extent is contained > in the file. There's no mechanism in btrfs extent tree v1 for atomically > replacing an extent with separately referenceable objects, and updating > all the pointers to parts of the old object to point to the new one. > Any such update could cascade into updates across all reflinks and > snapshots of the extent, so the write multiplier can be arbitrarily large. So the inode in the subvol tree points to an extent in the extent tree, and then the extent points to the space on disk? And only one extent in the extent tree can ever point to a given location on disk? In other words, if file B is a reflink copy of file A, and you update one page in file B, it can't just create 3 new extents in the extent tree: one that refers to the firt part of the original extent, one that refers to the last part of the original extent, and one for the new location of the new data? Instead file B refers to the original extent, and to one new extent, in such a way that the second superceeds part of the first only for file B? ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-16 18:46 ` Phillip Susi @ 2022-03-16 19:59 ` Zygo Blaxell 0 siblings, 0 replies; 71+ messages in thread From: Zygo Blaxell @ 2022-03-16 19:59 UTC (permalink / raw) To: Phillip Susi; +Cc: Qu Wenruo, Jan Ziak, linux-btrfs On Wed, Mar 16, 2022 at 02:46:33PM -0400, Phillip Susi wrote: > > Zygo Blaxell <ce3g8jdj@umail.furryterror.org> writes: > > > If the extent is compressed, you have to write a new extent, because > > there's no other way to atomically update a compressed extent. > > Right, that makes sense for compression. > > > If it's reflinked or snapshotted, you can't overwrite the data in place > > as long as a second reference to the data exists. This is what makes > > nodatacow and prealloc slow--on every write, they have to check whether > > the blocks being written are shared or not, and that check is expensive > > because it's a linear search of every reference for overlapping block > > ranges, and it can't exit the search early until it has proven there > > are no shared references. Contrast with datacow, which allocates a new > > unshared extent that it knows it can write to, and only has to check > > overwritten extents when they are completely overwritten (and only has > > to check for the existence of one reference, not enumerate them all). > > Right, I know you can't overwrite the data in place. What I'm not > understanding is why you can't just just write the new data elsewhere > and then free the no longer used portion of the old extent. > > > When a file refers to an extent, it refers to the entire extent from the > > file's subvol tree, even if only a single byte of the extent is contained > > in the file. There's no mechanism in btrfs extent tree v1 for atomically > > replacing an extent with separately referenceable objects, and updating > > all the pointers to parts of the old object to point to the new one. > > Any such update could cascade into updates across all reflinks and > > snapshots of the extent, so the write multiplier can be arbitrarily large. > > So the inode in the subvol tree points to an extent in the extent tree, > and then the extent points to the space on disk? The extent item tracks ownership of the space on disk. The extent item key _is_ the location on disk, so there's no need for a pointer in the item itself (e.g. read doesn't bother with the extent tree, it just goes straight from the inode ref to the data blocks and csums). The extent tree only comes up to resolve ownership issues, like whether the last reference to an extent has been removed, or a new reference added, or whether multiple references to the extent exist. > And only one extent in > the extent tree can ever point to a given location on disk? Correct. That restriction is characteristic of extent tree v1. Each extent maintains a list of references to itself. The extent is the exclusive owner of the physical space, and ownership of the extent item is shared by multiple inode references. Each inode reference knows which bytes of the extent it is referring to, but this information is scattered over the subvol trees and not available in the extent tree. Extent tree v2 creates a separate extent object in the extent tree for each reflink, and allows the physical regions covered by each extent to overlap. The inode reference is the exclusive owner of the extent item, and ownership of the physical space is shared by multiple extents. The extent tree in v2 tracks which inodes refer to which specific blocks, so the availability of a block can be computed without referring to any other trees. In v2, free space is recalculated when an extent is removed. The nearby extent tree is searched to see if any blocks no longer overlap with an extent, and any such blocks are added to free space. To me it looks like that free space search is O(N), since there's no proposed data structure to make it not a linear search of every possibly-overlapping extent item (all extents within MAX_EXTENT_SIZE bytes from the point where space was freed). The v2 proposal also has a deferred GC worker, so maybe the O(N) searches will be performed in a background thread where they aren't as time-sensitive, and maybe the search cost can be amortized over multiple deletions near the same physical position. Deferred GC doesn't help nodatacow or prealloc though, which have to know whether a block is shared during the write operation, and can't wait until later. > In other words, if file B is a reflink copy of file A, and you update > one page in file B, it can't just create 3 new extents in the extent > tree: one that refers to the firt part of the original extent, one that > refers to the last part of the original extent, and one for the new > location of the new data? Instead file B refers to the original extent, > and to one new extent, in such a way that the second superceeds part of > the first only for file B? Correct. Changing an extent in tree v1 requires updating every reference to the extent, because any inode referring to the entire extent will now need to refer to 3 distinct extent items. That means updating metadata pages in snapshots, and can lead to 4-digit multiples of write amplification with only a few dozen snapshots--in the worst cases there are page splits because the old data now needs space for 3x more reference items. So in v1 we don't do anything like that--extents are immutable from the moment they are created until their last reference is deleted. In v2, file B doesn't refer to file A's extent. Instead, file B creates a new extent which overlaps the physical space of file A's extent. After overwriting the one new page, file B then replaces its reference to file A's space with two new references to shared parts of file A's space, and a third new extent item for the new data in B. If file A is later deleted, the lack of reference to the middle of the physical space is (eventually) detected, and the overwritten part of the shared extent becomes free space. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-14 22:59 ` Zygo Blaxell 2022-03-15 18:28 ` Phillip Susi @ 2022-03-20 17:50 ` Forza 2022-03-20 21:15 ` Zygo Blaxell 1 sibling, 1 reply; 71+ messages in thread From: Forza @ 2022-03-20 17:50 UTC (permalink / raw) To: Zygo Blaxell, Phillip Susi; +Cc: Qu Wenruo, Jan Ziak, linux-btrfs On 2022-03-14 23:59, Zygo Blaxell wrote: > > On Mon, Mar 14, 2022 at 04:09:08PM -0400, Phillip Susi wrote: >> >> Qu Wenruo <quwenruo.btrfs@gmx.com> writes: >> >>> That's more or less expected. >>> >>> Autodefrag has two limitations: >>> >>> 1. Only defrag newer writes >>> It doesn't defrag older fragments. >>> This is the existing behavior from the beginning of autodefrag. >>> Thus it's not that effective against small random writes. >> >> I don't understand this bit. The whole point of defrag is to reduce the >> fragmentation of previous writes. New writes should always attempt to >> follow the previous one if possible. > This is my assumption as well. I believe that VM images was one of the original use cases (though I have no reference to that at the moment). The btrfs wiki[1] says the following for autodefrag: "Though autodefrag affects newly written data, it can read a few adjacent blocks (up to 64k) and write the contiguous extent to a new location. The adjacent blocks will be unshared." The Btrfs administration manual[2] says the following: "When enabled, small random writes into files (in a range of tens of kilobytes, currently it’s 64KiB) are detected and queued up for the defragmentation process. Not well suited for large database workloads." > New writes are allocated to the first available free space hole large > enough to hold them, starting from the point of the last write (plus > some other details like clustering and alignment). The goal is that > data writes from memory are sequential as much as possible, even if > many different files were written in the same transaction. > > btrfs extents are immutable, so the filesystem can't extend an existing > extent with new data. Instead, a new extent must be created that contains > both the old and new data to replace the old extent. At least one new > fragment must be created whenever the filesystem is modified. (In > zoned mode, this is strictly enforced by the underlying hardware.) > >> If auto defrag only changes the >> behavior of new writes, then how does it change it and why is that not >> the way new writes are always done? > > Autodefrag doesn't change write behavior directly. It is a > post-processing thread that rereads and rewrites recently written data, > _after_ it was originally written to disk. > > In theory, running defrag after the writes means that the writes can > be fast for low latency--they are a physically sequential stream of > blocks sent to the disk as fast as it can write them, because btrfs does > not have to be concerned with trying to achieve physical contiguity > of logically discontiguous data. Later on, when latency is no longer an > issue and some IO bandwidth is available, the fragments can be reread > and collected together into larger logically and physically contiguous > extents by a background process. > > In practice, autodefrag does only part of that task, badly. > > Say we have a program that writes 4K to the end of a file, every 5 > seconds, for 5 minutes. > > Every 30 seconds (default commit interval), kernel writeback submits all > the dirty pages for writing to btrfs, and in 30 seconds there will be 6 > x 4K = 24K of those. An extent in btrfs is created to hold the pages, > filled with the data blocks, connected to the various filesystem trees, > and flushed out to disk. > > Over 5 minutes this will happen 10 times, so the file contains 10 > fragments, each about 24K (commits are asynchronous, so it might be > 20K in one fragment and 28K in the next). > > After each commit, inodes with new extents are appended to a list > in memory. Each list entry contains an inode, a transid of the commit > where the first write occurred, and the last defrag offset. That list > is processed by a kernel thread some time after the commits are written > to disk. The thread searches the inodes for extents created after the > last defrag transid, invokes defrag_range on each of these, and advances > the offset. If the search offset reaches the end of file, then it is > reset to the beginning and another loop is done, and if the next search > loop over the file doesn't find new extents then the inode is removed > from the defrag list. > > If there's a 5 minute delay between the original writes and autodefrag > finally catching up, then autodefrag will detect 10 new extents and > run defrag_range over them. This is a read-then-write operation, since > the extent blocks may no longer be present in memory after writeback, > so autodefrag can easily fall behind writes if there are a lot of them. > Also the 64K size limit kicks in, so it might write 5 extents (2 x 24K = > 48K, but 3 x 24K = 72K, and autodefrag cuts off at 64K). > > If there's a 1 minute delay between the original writes and autodefrag, > then autodefrag will detect 1 new extents and run defrag over them > for a total of 5 new extents, about 240K each. If there's no delay > at all, then there will be 10 extents of 120K each--if autodefrag > runs immediately after commit, it will see only one extent in each > loop, and issue no defrag_range calls. > > Seen from the point of view of the disk, there are always at least > 10x 120K writes. In the no-autodefrag case it ends there. In the > autodefrag cases, some of the data is read and rewritten later to make > larger extents. > > In non-appending cases, the kernel autodefrag doesn't do very much useful > at all--random writes aren't logically contiguous, so autodefrag never > sees two adjacent extents in a search result, and therefore never sees > an opportunity to defrag anything. I have a worst-case scenario with Netdata. It stores historical data in ndf files that are up to 1GiB in size. In addition there is a journal file of about 100-200MiB. The extents are extremely small and sequential read speeds are around 1-2MiB/s (this is a HDD), which makes fetching historical data _extremely_ slow. Kernel 5.16.12, 5.16.16 btrfs-progs v5.16.2 Files: Size Date Name 1073741824 Mar 15 03:02 datafile-1-0000000113.ndf 1024217088 Mar 20 18:10 datafile-1-0000000114.ndf 140648448 Mar 15 03:02 journalfile-1-0000000113.njf 137732096 Mar 20 18:05 journalfile-1-0000000114.njf # compsize datafile-1-0000000113.ndf ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Processed 1 file, 59407 regular extents (59407 refs), 0 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 1.0G 1.0G 1.0G none 100% 1.0G 1.0G 1.0G ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The average size of the extents is here 17KiB # compsize journalfile-1-0000000113.njf ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Processed 1 file, 34338 regular extents (34338 refs), 0 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 134M 134M 134M none 100% 134M 134M 134M ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The average extent of the journal file is 4KiB. I have "mount -o autodefrag" enabled but it has no effect on these files. I have also tried enabling compression with "btrfs propterty set compression zstd" but it did not reduce the file size or change the amount of extents much. As a last resort I tried running Netdata behind "eatmydata", but it also didn't help. It seems that this case is exactly as Zygo describes, that small amounts of random writes do not get considered for defragment. It takes about 5 days to fill one of these ndf datafiles (about 8-9MiB per hour). # filefrag -v datafile-1-0000000114.ndf ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Filesystem type is: 9123683e File size of datafile-1-0000000114.ndf is 1028616192 (251127 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 0: 15863417202..15863417202: 1: 1: 1.. 1: 15863417597..15863417597: 1: 15863417203: 2: 2.. 2: 15874142482..15874142482: 1: 15863417598: 3: 3.. 8: 16093579003..16093579008: 6: 15874142483: 4: 9.. 13: 16017881714..16017881718: 5: 16093579009: 5: 14.. 19: 16095939276..16095939281: 6: 16017881719: 6: 20.. 27: 16110397810..16110397817: 8: 16095939282: 7: 28.. 29: 15874165302..15874165303: 2: 16110397818: 8: 30.. 30: 15874160314..15874160314: 1: 15874165304: 9: 31.. 31: 15874164808..15874164808: 1: 15874160315: 10: 32.. 39: 16110399763..16110399770: 8: 15874164809: 11: 40.. 43: 16017882226..16017882229: 4: 16110399771: 12: 44.. 47: 16017882292..16017882295: 4: 16017882230: 13: 48.. 53: 16097265263..16097265268: 6: 16017882296: 14: 54.. 55: 15877195212..15877195213: 2: 16097265269: 15: 56.. 60: 16018077866..16018077870: 5: 15877195214: 16: 61.. 64: 16017882755..16017882758: 4: 16018077871: 17: 65.. 68: 16017882623..16017882626: 4: 16017882759: 18: 69.. 69: 15877196587..15877196587: 1: 16017882627: 19: 70.. 70: 15877198419..15877198419: 1: 15877196588: 20: 71.. 82: 16110463493..16110463504: 12: 15877198420: 21: 83.. 83: 15878073533..15878073533: 1: 16110463505: 22: 84.. 84: 15878073875..15878073875: 1: 15878073534: 23: 85.. 85: 15878074124..15878074124: 1: 15878073876: 24: 86.. 86: 15878074958..15878074958: 1: 15878074125: 25: 87.. 87: 15878268816..15878268816: 1: 15878074959: 26: 88.. 88: 15878297633..15878297633: 1: 15878268817: 27: 89.. 89: 15878144045..15878144045: 1: 15878297634: 28: 90.. 90: 15878144854..15878144854: 1: 15878144046: 29: 91.. 91: 15880621654..15880621654: 1: 15878144855: 30: 92.. 92: 15884311220..15884311220: 1: 15880621655: 31: 93.. 93: 15884314722..15884314722: 1: 15884311221: 32: 94.. 94: 15884314726..15884314726: 1: 15884314723: 33: 95.. 95: 15877198895..15877198895: 1: 15884314727: 34: 96.. 96: 15877199305..15877199305: 1: 15877198896: 35: 97.. 98: 15877199312..15877199313: 2: 15877199306: 36: 99.. 101: 15878346833..15878346835: 3: 15877199314: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > At the time autodefrag was added to the kernel (May 2011), it was already > possible do to a better job in userspace for over a year (Feb 2010). > Between 2012 and 2021 there are only a handful of bug fixes, mostly of > the form "stop autodefrag from ruining things for the rest of the kernel." Doesn't userspace defragment has the disadvantage that is has to process the whole file with all it's extents? But if it could be used to defrag only the last few modified extents could help in situations like this? Certainly a userspace defragment daemon could be used to implement custom policies suitable for specific workloads. Thanks Forza [1] https://btrfs.wiki.kernel.org/index.php/Status [2] https://btrfs.readthedocs.io/en/latest/Administration.html?highlight=autodefrag#mount-options ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-20 17:50 ` Forza @ 2022-03-20 21:15 ` Zygo Blaxell 0 siblings, 0 replies; 71+ messages in thread From: Zygo Blaxell @ 2022-03-20 21:15 UTC (permalink / raw) To: Forza; +Cc: Phillip Susi, Qu Wenruo, Jan Ziak, linux-btrfs On Sun, Mar 20, 2022 at 06:50:55PM +0100, Forza wrote: > > > > On 2022-03-14 23:59, Zygo Blaxell wrote: > > > > On Mon, Mar 14, 2022 at 04:09:08PM -0400, Phillip Susi wrote: > > > > > > Qu Wenruo <quwenruo.btrfs@gmx.com> writes: > > > > > > > That's more or less expected. > > > > > > > > Autodefrag has two limitations: > > > > > > > > 1. Only defrag newer writes > > > > It doesn't defrag older fragments. > > > > This is the existing behavior from the beginning of autodefrag. > > > > Thus it's not that effective against small random writes. > > > > > > I don't understand this bit. The whole point of defrag is to reduce the > > > fragmentation of previous writes. New writes should always attempt to > > > follow the previous one if possible. > > > > This is my assumption as well. I believe that VM images was one of the > original use cases (though I have no reference to that at the moment). > > The btrfs wiki[1] says the following for autodefrag: > > "Though autodefrag affects newly written data, it can read a few adjacent > blocks (up to 64k) and write the contiguous extent to a new location. The > adjacent blocks will be unshared." > > The Btrfs administration manual[2] says the following: > > "When enabled, small random writes into files (in a range of tens of > kilobytes, currently it’s 64KiB) are detected and queued up for the > defragmentation process. Not well suited for large database workloads." These statements are not technically incorrect, but they are an understatement of the situation. What's missing is that autodefrag's very specific behavior isn't useful for typical user workloads prone to fragmentation. > > New writes are allocated to the first available free space hole large > > enough to hold them, starting from the point of the last write (plus > > some other details like clustering and alignment). The goal is that > > data writes from memory are sequential as much as possible, even if > > many different files were written in the same transaction. > > > > btrfs extents are immutable, so the filesystem can't extend an existing > > extent with new data. Instead, a new extent must be created that contains > > both the old and new data to replace the old extent. At least one new > > fragment must be created whenever the filesystem is modified. (In > > zoned mode, this is strictly enforced by the underlying hardware.) > > > > > If auto defrag only changes the > > > behavior of new writes, then how does it change it and why is that not > > > the way new writes are always done? > > > > Autodefrag doesn't change write behavior directly. It is a > > post-processing thread that rereads and rewrites recently written data, > > _after_ it was originally written to disk. > > > > In theory, running defrag after the writes means that the writes can > > be fast for low latency--they are a physically sequential stream of > > blocks sent to the disk as fast as it can write them, because btrfs does > > not have to be concerned with trying to achieve physical contiguity > > of logically discontiguous data. Later on, when latency is no longer an > > issue and some IO bandwidth is available, the fragments can be reread > > and collected together into larger logically and physically contiguous > > extents by a background process. > > > > In practice, autodefrag does only part of that task, badly. > > > > Say we have a program that writes 4K to the end of a file, every 5 > > seconds, for 5 minutes. > > > > Every 30 seconds (default commit interval), kernel writeback submits all > > the dirty pages for writing to btrfs, and in 30 seconds there will be 6 > > x 4K = 24K of those. An extent in btrfs is created to hold the pages, > > filled with the data blocks, connected to the various filesystem trees, > > and flushed out to disk. > > > > Over 5 minutes this will happen 10 times, so the file contains 10 > > fragments, each about 24K (commits are asynchronous, so it might be > > 20K in one fragment and 28K in the next). > > > > After each commit, inodes with new extents are appended to a list > > in memory. Each list entry contains an inode, a transid of the commit > > where the first write occurred, and the last defrag offset. That list > > is processed by a kernel thread some time after the commits are written > > to disk. The thread searches the inodes for extents created after the > > last defrag transid, invokes defrag_range on each of these, and advances > > the offset. If the search offset reaches the end of file, then it is > > reset to the beginning and another loop is done, and if the next search > > loop over the file doesn't find new extents then the inode is removed > > from the defrag list. > > > > If there's a 5 minute delay between the original writes and autodefrag > > finally catching up, then autodefrag will detect 10 new extents and > > run defrag_range over them. This is a read-then-write operation, since > > the extent blocks may no longer be present in memory after writeback, > > so autodefrag can easily fall behind writes if there are a lot of them. > > Also the 64K size limit kicks in, so it might write 5 extents (2 x 24K = > > 48K, but 3 x 24K = 72K, and autodefrag cuts off at 64K). > > > > If there's a 1 minute delay between the original writes and autodefrag, > > then autodefrag will detect 1 new extents and run defrag over them > > for a total of 5 new extents, about 240K each. If there's no delay > > at all, then there will be 10 extents of 120K each--if autodefrag > > runs immediately after commit, it will see only one extent in each > > loop, and issue no defrag_range calls. > > > > Seen from the point of view of the disk, there are always at least > > 10x 120K writes. In the no-autodefrag case it ends there. In the > > autodefrag cases, some of the data is read and rewritten later to make > > larger extents. > > > > In non-appending cases, the kernel autodefrag doesn't do very much useful > > at all--random writes aren't logically contiguous, so autodefrag never > > sees two adjacent extents in a search result, and therefore never sees > > an opportunity to defrag anything. > > I have a worst-case scenario with Netdata. It stores historical data in ndf > files that are up to 1GiB in size. In addition there is a journal file of > about 100-200MiB. The extents are extremely small and sequential read speeds > are around 1-2MiB/s (this is a HDD), which makes fetching historical data > _extremely_ slow. > > Kernel 5.16.12, 5.16.16 > btrfs-progs v5.16.2 > > Files: > Size Date Name > 1073741824 Mar 15 03:02 datafile-1-0000000113.ndf > 1024217088 Mar 20 18:10 datafile-1-0000000114.ndf > 140648448 Mar 15 03:02 journalfile-1-0000000113.njf > 137732096 Mar 20 18:05 journalfile-1-0000000114.njf > > > # compsize datafile-1-0000000113.ndf > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Processed 1 file, 59407 regular extents (59407 refs), 0 inline. > Type Perc Disk Usage Uncompressed Referenced > TOTAL 100% 1.0G 1.0G 1.0G > none 100% 1.0G 1.0G 1.0G > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > The average size of the extents is here 17KiB > > > # compsize journalfile-1-0000000113.njf > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Processed 1 file, 34338 regular extents (34338 refs), 0 inline. > Type Perc Disk Usage Uncompressed Referenced > TOTAL 100% 134M 134M 134M > none 100% 134M 134M 134M > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > The average extent of the journal file is 4KiB. > > I have "mount -o autodefrag" enabled but it has no effect on these files. I > have also tried enabling compression with "btrfs propterty set compression > zstd" but it did not reduce the file size or change the amount of extents > much. > > As a last resort I tried running Netdata behind "eatmydata", but it also > didn't help. > > It seems that this case is exactly as Zygo describes, that small amounts of > random writes do not get considered for defragment. It takes about 5 days to > fill one of these ndf datafiles (about 8-9MiB per hour). > > # filefrag -v datafile-1-0000000114.ndf > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Filesystem type is: 9123683e > File size of datafile-1-0000000114.ndf is 1028616192 (251127 blocks of 4096 > bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 0: 15863417202..15863417202: 1: > 1: 1.. 1: 15863417597..15863417597: 1: 15863417203: > 2: 2.. 2: 15874142482..15874142482: 1: 15863417598: > 3: 3.. 8: 16093579003..16093579008: 6: 15874142483: > 4: 9.. 13: 16017881714..16017881718: 5: 16093579009: > 5: 14.. 19: 16095939276..16095939281: 6: 16017881719: > 6: 20.. 27: 16110397810..16110397817: 8: 16095939282: > 7: 28.. 29: 15874165302..15874165303: 2: 16110397818: > 8: 30.. 30: 15874160314..15874160314: 1: 15874165304: > 9: 31.. 31: 15874164808..15874164808: 1: 15874160315: > 10: 32.. 39: 16110399763..16110399770: 8: 15874164809: > 11: 40.. 43: 16017882226..16017882229: 4: 16110399771: > 12: 44.. 47: 16017882292..16017882295: 4: 16017882230: > 13: 48.. 53: 16097265263..16097265268: 6: 16017882296: > 14: 54.. 55: 15877195212..15877195213: 2: 16097265269: > 15: 56.. 60: 16018077866..16018077870: 5: 15877195214: > 16: 61.. 64: 16017882755..16017882758: 4: 16018077871: > 17: 65.. 68: 16017882623..16017882626: 4: 16017882759: > 18: 69.. 69: 15877196587..15877196587: 1: 16017882627: > 19: 70.. 70: 15877198419..15877198419: 1: 15877196588: > 20: 71.. 82: 16110463493..16110463504: 12: 15877198420: > 21: 83.. 83: 15878073533..15878073533: 1: 16110463505: > 22: 84.. 84: 15878073875..15878073875: 1: 15878073534: > 23: 85.. 85: 15878074124..15878074124: 1: 15878073876: > 24: 86.. 86: 15878074958..15878074958: 1: 15878074125: > 25: 87.. 87: 15878268816..15878268816: 1: 15878074959: > 26: 88.. 88: 15878297633..15878297633: 1: 15878268817: > 27: 89.. 89: 15878144045..15878144045: 1: 15878297634: > 28: 90.. 90: 15878144854..15878144854: 1: 15878144046: > 29: 91.. 91: 15880621654..15880621654: 1: 15878144855: > 30: 92.. 92: 15884311220..15884311220: 1: 15880621655: > 31: 93.. 93: 15884314722..15884314722: 1: 15884311221: > 32: 94.. 94: 15884314726..15884314726: 1: 15884314723: > 33: 95.. 95: 15877198895..15877198895: 1: 15884314727: > 34: 96.. 96: 15877199305..15877199305: 1: 15877198896: > 35: 97.. 98: 15877199312..15877199313: 2: 15877199306: > 36: 99.. 101: 15878346833..15878346835: 3: 15877199314: > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > At the time autodefrag was added to the kernel (May 2011), it was already > > possible do to a better job in userspace for over a year (Feb 2010). > > Between 2012 and 2021 there are only a handful of bug fixes, mostly of > > the form "stop autodefrag from ruining things for the rest of the kernel." > > Doesn't userspace defragment has the disadvantage that is has to process the > whole file with all it's extents? No. There's a DEFRAG_RANGE ioctl which will defragment a specific range of a specific file, with some restrictions. The restrictions can be worked around by making a temporary copy and deduping it over the original data. It's possible to rearrange extents more or less arbitrarily from userspace, mostly invisible to applications that might be reading or writing them at the same time. I forget exactly what the restrictions are. We would need defrag to skip the write permission check for superuser, and skip atime/mtime/ctime updates, the same way the dedupe ioctl does. We'd also need to remove any minimum size limits or second-guessing in DEFRAG_RANGE so that it does precisely what userspace tells it to do. Currently there's some hardcoded assumptions built into DEFRAG_RANGE because 'btrfs fi defrag' isn't smart enough to issue good DEFRAG_RANGE commands. Unfortunately both defrag and dedupe have a problem where they will only operate on one extent reference at a time, so a solution that supports snapshots will involve a mix of both ioctls (defrag to construct a defragmented extent, dedupe to install that extent in each affected snapshot, plus some logic to decide whether it's worthwhile to do that or leave the old extents alone). That can be done _after_ getting basic autodefrag working for the easier single-subvol cases. > But if it could be used to defrag only the > last few modified extents could help in situations like this? "Last few modified extents" is the broken thing the kernel does. Autodefrag should start with those (they can be found very quickly using tree_search), but the defrag region must expand to include some of the older adjacent extents too. > Certainly a > userspace defragment daemon could be used to implement custom policies > suitable for specific workloads. > > Thanks > Forza > > > [1] https://btrfs.wiki.kernel.org/index.php/Status > [2] https://btrfs.readthedocs.io/en/latest/Administration.html?highlight=autodefrag#mount-options > ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-07 2:39 ` Qu Wenruo 2022-03-07 7:31 ` Qu Wenruo @ 2022-03-08 21:57 ` Jan Ziak 2022-03-08 23:40 ` Qu Wenruo 2022-03-09 4:48 ` Zygo Blaxell 1 sibling, 2 replies; 71+ messages in thread From: Jan Ziak @ 2022-03-08 21:57 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs On Mon, Mar 7, 2022 at 3:39 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > On 2022/3/7 10:23, Jan Ziak wrote: > > BTW: "compsize file-with-million-extents" finishes in 0.2 seconds > > (uses BTRFS_IOC_TREE_SEARCH_V2 ioctl), but "filefrag > > file-with-million-extents" doesn't finish even after several minutes > > of time (uses FS_IOC_FIEMAP ioctl - manages to perform only about 5 > > ioctl syscalls per second - and appears to be slowing down as the > > value of the "fm_start" ioctl argument grows; e2fsprogs version > > 1.46.5). It would be nice if filefrag was faster than just a few > > ioctls per second. > > This is mostly a race with autodefrag. > > Both are using file extent map, thus if autodefrag is still trying to > redirty the file again and again, it would definitely cause problems for > anything also using file extent map. It isn't caused by a race with autodefrag, but by something else. Autodefrag was turned off when I was running "filefrag file-with-million-extents". $ /usr/bin/time filefrag file-with-million-extents.sqlite Ctrl+C Command terminated by signal 2 0.000000 user, 4.327145 system, 0:04.331167 elapsed, 99% CPU Sincerely Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-08 21:57 ` Jan Ziak @ 2022-03-08 23:40 ` Qu Wenruo 2022-03-09 22:22 ` Jan Ziak 2022-03-09 4:48 ` Zygo Blaxell 1 sibling, 1 reply; 71+ messages in thread From: Qu Wenruo @ 2022-03-08 23:40 UTC (permalink / raw) To: Jan Ziak; +Cc: linux-btrfs On 2022/3/9 05:57, Jan Ziak wrote: > On Mon, Mar 7, 2022 at 3:39 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >> On 2022/3/7 10:23, Jan Ziak wrote: >>> BTW: "compsize file-with-million-extents" finishes in 0.2 seconds >>> (uses BTRFS_IOC_TREE_SEARCH_V2 ioctl), but "filefrag >>> file-with-million-extents" doesn't finish even after several minutes >>> of time (uses FS_IOC_FIEMAP ioctl - manages to perform only about 5 >>> ioctl syscalls per second - and appears to be slowing down as the >>> value of the "fm_start" ioctl argument grows; e2fsprogs version >>> 1.46.5). It would be nice if filefrag was faster than just a few >>> ioctls per second. >> >> This is mostly a race with autodefrag. >> >> Both are using file extent map, thus if autodefrag is still trying to >> redirty the file again and again, it would definitely cause problems for >> anything also using file extent map. > > It isn't caused by a race with autodefrag, but by something else. > Autodefrag was turned off when I was running "filefrag > file-with-million-extents". > > $ /usr/bin/time filefrag file-with-million-extents.sqlite > Ctrl+C Command terminated by signal 2 > 0.000000 user, 4.327145 system, 0:04.331167 elapsed, 99% CPU Too many file extents will slow down the full fiemap call. If you use ranged fiemap, like: xfs_io -c "fiemap -v 0 4k" <file> It should finish very quick. BTW, I have send out a new autodefrag patch and CCed you. Mind to test that patch? (Better with trace events enabled) Thanks, Qu > > Sincerely > Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-08 23:40 ` Qu Wenruo @ 2022-03-09 22:22 ` Jan Ziak 2022-03-09 22:44 ` Qu Wenruo 0 siblings, 1 reply; 71+ messages in thread From: Jan Ziak @ 2022-03-09 22:22 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs On Wed, Mar 9, 2022 at 12:40 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > On 2022/3/9 05:57, Jan Ziak wrote: > > On Mon, Mar 7, 2022 at 3:39 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > >> On 2022/3/7 10:23, Jan Ziak wrote: > >>> BTW: "compsize file-with-million-extents" finishes in 0.2 seconds > >>> (uses BTRFS_IOC_TREE_SEARCH_V2 ioctl), but "filefrag > >>> file-with-million-extents" doesn't finish even after several minutes > >>> of time (uses FS_IOC_FIEMAP ioctl - manages to perform only about 5 > >>> ioctl syscalls per second - and appears to be slowing down as the > >>> value of the "fm_start" ioctl argument grows; e2fsprogs version > >>> 1.46.5). It would be nice if filefrag was faster than just a few > >>> ioctls per second. > >> > >> This is mostly a race with autodefrag. > >> > >> Both are using file extent map, thus if autodefrag is still trying to > >> redirty the file again and again, it would definitely cause problems for > >> anything also using file extent map. > > > > It isn't caused by a race with autodefrag, but by something else. > > Autodefrag was turned off when I was running "filefrag > > file-with-million-extents". > > > > $ /usr/bin/time filefrag file-with-million-extents.sqlite > > Ctrl+C Command terminated by signal 2 > > 0.000000 user, 4.327145 system, 0:04.331167 elapsed, 99% CPU > > Too many file extents will slow down the full fiemap call. > > If you use ranged fiemap, like: > > xfs_io -c "fiemap -v 0 4k" <file> > > It should finish very quick. Unfortunately, that doesn't seem to be the case (Linux 5.16.12). xfs_io -c "fiemap -v 0 4g" completes and prints .... 16935: [8387168..8388791]: 22237781600..22237783223 1624 0x0 in 0.6 seconds. But xfs_io -c "fiemap -v 0 40g" is significantly slower, does not complete in a reasonable time, and makes it to 1000 .... 1000: [154576..154903]: 22232564688..22232565015 328 0x0 .... in 6.5 seconds. The NVMe device was mostly idle when running the above commands (reads and writes per second were close to zero). In summary: xfs_io -c "fiemap -v 0 4g" is approximately 185 times faster than xfs_io -c "fiemap -v 0 40g". -Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-09 22:22 ` Jan Ziak @ 2022-03-09 22:44 ` Qu Wenruo 2022-03-09 22:55 ` Jan Ziak 0 siblings, 1 reply; 71+ messages in thread From: Qu Wenruo @ 2022-03-09 22:44 UTC (permalink / raw) To: Jan Ziak; +Cc: linux-btrfs On 2022/3/10 06:22, Jan Ziak wrote: > On Wed, Mar 9, 2022 at 12:40 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >> On 2022/3/9 05:57, Jan Ziak wrote: >>> On Mon, Mar 7, 2022 at 3:39 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >>>> On 2022/3/7 10:23, Jan Ziak wrote: >>>>> BTW: "compsize file-with-million-extents" finishes in 0.2 seconds >>>>> (uses BTRFS_IOC_TREE_SEARCH_V2 ioctl), but "filefrag >>>>> file-with-million-extents" doesn't finish even after several minutes >>>>> of time (uses FS_IOC_FIEMAP ioctl - manages to perform only about 5 >>>>> ioctl syscalls per second - and appears to be slowing down as the >>>>> value of the "fm_start" ioctl argument grows; e2fsprogs version >>>>> 1.46.5). It would be nice if filefrag was faster than just a few >>>>> ioctls per second. >>>> >>>> This is mostly a race with autodefrag. >>>> >>>> Both are using file extent map, thus if autodefrag is still trying to >>>> redirty the file again and again, it would definitely cause problems for >>>> anything also using file extent map. >>> >>> It isn't caused by a race with autodefrag, but by something else. >>> Autodefrag was turned off when I was running "filefrag >>> file-with-million-extents". >>> >>> $ /usr/bin/time filefrag file-with-million-extents.sqlite >>> Ctrl+C Command terminated by signal 2 >>> 0.000000 user, 4.327145 system, 0:04.331167 elapsed, 99% CPU >> >> Too many file extents will slow down the full fiemap call. >> >> If you use ranged fiemap, like: >> >> xfs_io -c "fiemap -v 0 4k" <file> >> >> It should finish very quick. > > Unfortunately, that doesn't seem to be the case (Linux 5.16.12). > > xfs_io -c "fiemap -v 0 4g" completes and prints Well, I literally mean 4k, which is ensured to be one extent. Thanks, Qu > > .... > 16935: [8387168..8388791]: 22237781600..22237783223 1624 0x0 > > in 0.6 seconds. > > But xfs_io -c "fiemap -v 0 40g" is significantly slower, does not > complete in a reasonable time, and makes it to 1000 > > .... > 1000: [154576..154903]: 22232564688..22232565015 328 0x0 > .... > > in 6.5 seconds. > > The NVMe device was mostly idle when running the above commands (reads > and writes per second were close to zero). > > In summary: xfs_io -c "fiemap -v 0 4g" is approximately 185 times > faster than xfs_io -c "fiemap -v 0 40g". > > -Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-09 22:44 ` Qu Wenruo @ 2022-03-09 22:55 ` Jan Ziak 2022-03-09 23:00 ` Jan Ziak 0 siblings, 1 reply; 71+ messages in thread From: Jan Ziak @ 2022-03-09 22:55 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs On Wed, Mar 9, 2022 at 11:44 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > xfs_io -c "fiemap -v 0 40g" > > Well, I literally mean 4k, which is ensured to be one extent. The usefulness of such information would be 4k/40g = 1e-6 = 0.0001%. -Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-09 22:55 ` Jan Ziak @ 2022-03-09 23:00 ` Jan Ziak 0 siblings, 0 replies; 71+ messages in thread From: Jan Ziak @ 2022-03-09 23:00 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs On Wed, Mar 9, 2022 at 11:55 PM Jan Ziak <0xe2.0x9a.0x9b@gmail.com> wrote: > > On Wed, Mar 9, 2022 at 11:44 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > xfs_io -c "fiemap -v 0 40g" > > > > Well, I literally mean 4k, which is ensured to be one extent. > > The usefulness of such information would be 4k/40g = 1e-6 = 0.0001%. 1e-7 or 0.00001%, of course. Sorry about the confusion. -Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-08 21:57 ` Jan Ziak 2022-03-08 23:40 ` Qu Wenruo @ 2022-03-09 4:48 ` Zygo Blaxell 1 sibling, 0 replies; 71+ messages in thread From: Zygo Blaxell @ 2022-03-09 4:48 UTC (permalink / raw) To: Jan Ziak; +Cc: Qu Wenruo, linux-btrfs On Tue, Mar 08, 2022 at 10:57:51PM +0100, Jan Ziak wrote: > On Mon, Mar 7, 2022 at 3:39 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > On 2022/3/7 10:23, Jan Ziak wrote: > > > BTW: "compsize file-with-million-extents" finishes in 0.2 seconds > > > (uses BTRFS_IOC_TREE_SEARCH_V2 ioctl), but "filefrag > > > file-with-million-extents" doesn't finish even after several minutes > > > of time (uses FS_IOC_FIEMAP ioctl - manages to perform only about 5 > > > ioctl syscalls per second - and appears to be slowing down as the > > > value of the "fm_start" ioctl argument grows; e2fsprogs version > > > 1.46.5). It would be nice if filefrag was faster than just a few > > > ioctls per second. > > > > This is mostly a race with autodefrag. > > > > Both are using file extent map, thus if autodefrag is still trying to > > redirty the file again and again, it would definitely cause problems for > > anything also using file extent map. > > It isn't caused by a race with autodefrag, but by something else. > Autodefrag was turned off when I was running "filefrag > file-with-million-extents". > > $ /usr/bin/time filefrag file-with-million-extents.sqlite > Ctrl+C Command terminated by signal 2 > 0.000000 user, 4.327145 system, 0:04.331167 elapsed, 99% CPU FIEMAP will try to populate the SHARED bit for each extent, which requires checking every extent that overlaps a block range to see if the block is present. It can be very expensive for large, random-written files. No way to fix that without disabling the SHARED bit in FIEMAP. > Sincerely > Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-06 15:59 Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit Jan Ziak 2022-03-07 0:48 ` Qu Wenruo @ 2022-03-07 14:30 ` Phillip Susi 2022-03-08 21:43 ` Jan Ziak 2022-03-16 12:47 ` Kai Krakow 2 siblings, 1 reply; 71+ messages in thread From: Phillip Susi @ 2022-03-07 14:30 UTC (permalink / raw) To: Jan Ziak; +Cc: linux-btrfs Jan Ziak <0xe2.0x9a.0x9b@gmail.com> writes: > Manual defragmentation decreased the file's size by 7 GB: Eh? How does defragging change a file's size? ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-07 14:30 ` Phillip Susi @ 2022-03-08 21:43 ` Jan Ziak 2022-03-09 18:46 ` Phillip Susi 0 siblings, 1 reply; 71+ messages in thread From: Jan Ziak @ 2022-03-08 21:43 UTC (permalink / raw) To: Phillip Susi; +Cc: linux-btrfs On Mon, Mar 7, 2022 at 3:31 PM Phillip Susi <phill@thesusis.net> wrote: > Jan Ziak <0xe2.0x9a.0x9b@gmail.com> writes: > > Manual defragmentation decreased the file's size by 7 GB: > Eh? How does defragging change a file's size? I noticed this inaccurate wording in my email as well, but that was (unfortunately) after I already sent the email. I was hoping that after examining the compsize logs present in the email, readers would understand what the term "file size" means in this particular case. Sincerely Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-08 21:43 ` Jan Ziak @ 2022-03-09 18:46 ` Phillip Susi 2022-03-09 21:35 ` Jan Ziak 0 siblings, 1 reply; 71+ messages in thread From: Phillip Susi @ 2022-03-09 18:46 UTC (permalink / raw) To: Jan Ziak; +Cc: linux-btrfs Jan Ziak <0xe2.0x9a.0x9b@gmail.com> writes: > I noticed this inaccurate wording in my email as well, but that was > (unfortunately) after I already sent the email. I was hoping that > after examining the compsize logs present in the email, readers would > understand what the term "file size" means in this particular case. I don't understand. You stated that the size decreased by 7 GB, and the size figures that followed appeared to bear that out. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-09 18:46 ` Phillip Susi @ 2022-03-09 21:35 ` Jan Ziak 2022-03-14 20:02 ` Phillip Susi 0 siblings, 1 reply; 71+ messages in thread From: Jan Ziak @ 2022-03-09 21:35 UTC (permalink / raw) To: Phillip Susi; +Cc: linux-btrfs On Wed, Mar 9, 2022 at 7:47 PM Phillip Susi <phill@thesusis.net> wrote: > Jan Ziak <0xe2.0x9a.0x9b@gmail.com> writes: > > > I noticed this inaccurate wording in my email as well, but that was > > (unfortunately) after I already sent the email. I was hoping that > > after examining the compsize logs present in the email, readers would > > understand what the term "file size" means in this particular case. > > I don't understand. You stated that the size decreased by 7 GB, and the > size figures that followed appeared to bear that out. The actual disk usage of a file in a copy-on-write filesystem can be much larger than sb.st_size obtained via fstat(fd, &sb) if, for example, a program performs many (millions) single-byte file changes using write(fd, buf, 1) to distinct/random offsets in the file. Before running "btrfs fi de file.sqlite": the disk usage of the file was 47GB, 1834778 extents After running "btrfs fi de file.sqlite; sync": the disk usage of the file was 40GB, 13074 extents Sincerely Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-09 21:35 ` Jan Ziak @ 2022-03-14 20:02 ` Phillip Susi 2022-03-14 21:53 ` Jan Ziak 2022-03-16 16:52 ` Andrei Borzenkov 0 siblings, 2 replies; 71+ messages in thread From: Phillip Susi @ 2022-03-14 20:02 UTC (permalink / raw) To: Jan Ziak; +Cc: linux-btrfs Jan Ziak <0xe2.0x9a.0x9b@gmail.com> writes: > The actual disk usage of a file in a copy-on-write filesystem can be > much larger than sb.st_size obtained via fstat(fd, &sb) if, for > example, a program performs many (millions) single-byte file changes > using write(fd, buf, 1) to distinct/random offsets in the file. How? I mean if you write to part of the file a new block is written somewhere else but the original one is then freed, so the overall size should not change. Just because all of the blocks of the file are not contiguous does not mean that the file has more of them, and making them contiguous does not reduce the number of them. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-14 20:02 ` Phillip Susi @ 2022-03-14 21:53 ` Jan Ziak 2022-03-14 22:24 ` Remi Gauvin 2022-03-15 18:15 ` Phillip Susi 2022-03-16 16:52 ` Andrei Borzenkov 1 sibling, 2 replies; 71+ messages in thread From: Jan Ziak @ 2022-03-14 21:53 UTC (permalink / raw) Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 2479 bytes --] On Mon, Mar 14, 2022 at 9:05 PM Phillip Susi <phill@thesusis.net> wrote: > Jan Ziak <0xe2.0x9a.0x9b@gmail.com> writes: > > > The actual disk usage of a file in a copy-on-write filesystem can be > > much larger than sb.st_size obtained via fstat(fd, &sb) if, for > > example, a program performs many (millions) single-byte file changes > > using write(fd, buf, 1) to distinct/random offsets in the file. > > How? I mean if you write to part of the file a new block is written > somewhere else but the original one is then freed, so the overall size > should not change. Just because all of the blocks of the file are not > contiguous does not mean that the file has more of them, and making them > contiguous does not reduce the number of them. > It is true that it is possible to design a copy-on-write filesystem, albeit it may have additional costs, that will never waste a single extent even in the case of high-fragmentation files. But, btrfs isn't such a filesystem. The manpage of /usr/bin/compsize contains the following diagram (use a fixed font when viewing): +-------+-------+---------------+ extent A | used | waste | used | +-------+-------+---------------+ extent B | used | +-------+ However, what the manpage doesn't mention is that, in the case of btrfs, the above diagram applies not only to compressed extents but to other types of extents as well. You can examine this yourself if you compile compsize-1.5 using "make debug" on your machine and use the Bash script that is attached to this email. The Bash script creates one 10 MiB file. This file has 1 extent of size 10 MiB (assuming the btrfs filesystem has enough non-fragmented free space to create a continuous extent of size 10 MiB). Then the script writes random 4K blocks at random 4K offsets in the file. Examination of compsize's debug output shows that the whole 10 MiB extent is still stored on the storage device, despite the fact that many of the 4K pages comprising the 10 MiB extent have been overwritten and the file has been synced to the storage device: .... regular: ram_bytes=10485760 compression=0 disk_num_bytes=10485760 .... In this test case, "Disk Usage" is 60% higher than the file's size: $ compsize data Processed 1 file, 612 regular extents (1221 refs), 0 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 16M 16M 10M -Jan [-- Attachment #2: btrfs-waste.sh --] [-- Type: application/x-shellscript, Size: 557 bytes --] ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-14 21:53 ` Jan Ziak @ 2022-03-14 22:24 ` Remi Gauvin 2022-03-14 22:51 ` Zygo Blaxell 2022-03-15 18:15 ` Phillip Susi 1 sibling, 1 reply; 71+ messages in thread From: Remi Gauvin @ 2022-03-14 22:24 UTC (permalink / raw) To: linux-btrfs On 2022-03-14 5:53 p.m., Jan Ziak wrote: > .... > > In this test case, "Disk Usage" is 60% higher than the file's size: > > $ compsize data > Processed 1 file, 612 regular extents (1221 refs), 0 inline. > Type Perc Disk Usage Uncompressed Referenced > TOTAL 100% 16M 16M 10M It would be nice if we could get a mount option to specify maximum extent size, so this effect could be minimized on SSD without having to use compress-force. (Or maybe this should be the default When ssd mode is automaticallyd detected.) ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-14 22:24 ` Remi Gauvin @ 2022-03-14 22:51 ` Zygo Blaxell 2022-03-14 23:07 ` Remi Gauvin 0 siblings, 1 reply; 71+ messages in thread From: Zygo Blaxell @ 2022-03-14 22:51 UTC (permalink / raw) To: Remi Gauvin; +Cc: linux-btrfs On Mon, Mar 14, 2022 at 06:24:43PM -0400, Remi Gauvin wrote: > On 2022-03-14 5:53 p.m., Jan Ziak wrote: > > > > .... > > > > In this test case, "Disk Usage" is 60% higher than the file's size: > > > > $ compsize data > > Processed 1 file, 612 regular extents (1221 refs), 0 inline. > > Type Perc Disk Usage Uncompressed Referenced > > TOTAL 100% 16M 16M 10M > > > It would be nice if we could get a mount option to specify maximum > extent size, so this effect could be minimized on SSD without having to > use compress-force. (Or maybe this should be the default When ssd mode > is automaticallyd detected.) If you never use prealloc or defrag, it's usually not a problem. Files mostly fall into two categories: big sequential writes (where big extents are better) or small random writes (where big extents are bad, but you don't have any of those because you're doing small random writes all the time). Writeback gets this right most of the time, so the extents end up the right sizes on disk. If all your writes are random, short, and aligned to a multiple of 4K, then you'll end up in a steady state with a lot of short extents and little to no wasted space. If you run defrag on that, you end up with half the space wasted, and if the writes continue, lots of small extents either way. Prealloc's bad effects are similar to defrag, but with more reliable losses. A mount option to disable prealloc globally might be very useful--I run a number of apps that think prealloc doesn't waste huge amounts of CPU time and disk space on datacow files, and I grow weary of patching or LD_PRELOAD-hacking them all the time to not call fallocate(). ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-14 22:51 ` Zygo Blaxell @ 2022-03-14 23:07 ` Remi Gauvin 2022-03-14 23:39 ` Zygo Blaxell 0 siblings, 1 reply; 71+ messages in thread From: Remi Gauvin @ 2022-03-14 23:07 UTC (permalink / raw) To: linux-btrfs On 2022-03-14 6:51 p.m., Zygo Blaxell wrote: > > If you never use prealloc or defrag, it's usually not a problem. > You're assuming that the file is created from scratch on that media. VM and databases that are restored from images/backups, or re-written as some kind of maintenance, (shrink vm images, compress database, or whatever) become a huge problem. In one instance, I had a VM image that was taking up more than 100% of it's filesize due to lack of defrag. For a while I was regularly defragmenting those with target size of 100MB as the only way to garbage collect, but that is a shameful waste of write cycles on SSD. Adding compress-force=lzo was the only way for me to solve this issue, (and it even seems to help performance (on SSD, *not* HDD), though probably not for small random reads,, I haven't properly compared that.) ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-14 23:07 ` Remi Gauvin @ 2022-03-14 23:39 ` Zygo Blaxell 2022-03-15 14:14 ` Remi Gauvin 0 siblings, 1 reply; 71+ messages in thread From: Zygo Blaxell @ 2022-03-14 23:39 UTC (permalink / raw) To: Remi Gauvin; +Cc: linux-btrfs On Mon, Mar 14, 2022 at 07:07:44PM -0400, Remi Gauvin wrote: > On 2022-03-14 6:51 p.m., Zygo Blaxell wrote: > > If you never use prealloc or defrag, it's usually not a problem. > > You're assuming that the file is created from scratch on that media. VM > and databases that are restored from images/backups, or re-written as > some kind of maintenance, (shrink vm images, compress database, or > whatever) become a huge problem. VM images do sometimes combine sequential and random writes and create a lot of waste. They are one of the outliers that is a problem case even with a normal life cycle (as opposed to one interrupted by backup restore). A cluster command in SQL can instantly double a DB size. > In one instance, I had a VM image that was taking up more than 100% of > it's filesize due to lack of defrag. For a while I was regularly > defragmenting those with target size of 100MB as the only way to garbage > collect, but that is a shameful waste of write cycles on SSD. Adding > compress-force=lzo was the only way for me to solve this issue, (and it > even seems to help performance (on SSD, *not* HDD), though probably not > for small random reads,, I haven't properly compared that.) Ideally we'd have a proper garbage collection tool for btrfs that ran defrag _only_ on extents that are holding references to wasted space, which is the side-effect of defrag that most people want instead of what defrag nominally tries to do. I have it on my already too-long to-do list. If we're adding a mount option for this (I'm not opposed to it, I'm pointing out that it's not the first tool to reach for), then ideally we'd overload it for the compressed batch size (currently hardcoded at 512K). I have IO patterns that would like compress-force to write 128M uncompressed extents, and provide enough extents at a time to keep all the cores busy sequentially compressing a single extent on each one. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-14 23:39 ` Zygo Blaxell @ 2022-03-15 14:14 ` Remi Gauvin 2022-03-15 18:51 ` Zygo Blaxell 0 siblings, 1 reply; 71+ messages in thread From: Remi Gauvin @ 2022-03-15 14:14 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs On 2022-03-14 7:39 p.m., Zygo Blaxell wrote: > > If we're adding a mount option for this (I'm not opposed to it, I'm > pointing out that it's not the first tool to reach for), then ideally > we'd overload it for the compressed batch size (currently hardcoded > at 512K). Are there any advantages to extents larger than 256K on ssd Media? Even if a much needed garbage collection process were to be created, the smaller extents would mean less data would need to be re-written, (and potentially duplicated due to snapshots and ref copies.) The fine details on how to implement all of this is way over my head, but it seemed to me like the logic to keep the extents small is already more or less already there, and would need relatively very little work to manifest. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-15 14:14 ` Remi Gauvin @ 2022-03-15 18:51 ` Zygo Blaxell 2022-03-15 19:22 ` Remi Gauvin 0 siblings, 1 reply; 71+ messages in thread From: Zygo Blaxell @ 2022-03-15 18:51 UTC (permalink / raw) To: Remi Gauvin; +Cc: linux-btrfs On Tue, Mar 15, 2022 at 10:14:01AM -0400, Remi Gauvin wrote: > On 2022-03-14 7:39 p.m., Zygo Blaxell wrote: > > If we're adding a mount option for this (I'm not opposed to it, I'm > > pointing out that it's not the first tool to reach for), then ideally > > we'd overload it for the compressed batch size (currently hardcoded > > at 512K). > > Are there any advantages to extents larger than 256K on ssd Media? The main advantage of larger extents is smaller metadata, and it doesn't matter very much whether it's SSD or HDD. Adjacent extents will be in the same metadata page, so not much is lost with 256K extents even on HDD, as long as they are physically allocated adjacent to each other. There is a CPU hit for every extent, and when snapshot pages become unshared, every distinct extent on the page needs its reference count updated for the new page. The costs of small extents add up during balances, resizes, and snapshot deletes, but on a small filesystem you'd want smaller extents so that balances and resizes are possible at all (this is why there's a 128M limit now--previously, extents of multiple GB were possible). Averaged across my filesystems, half of the data blocks are in extents below 512K, and only 1% of extents are 1M or larger. Capping the extent size at 256K wouldn't make much difference--the total extent count would increase by less than 5%. In my defrag experiments, the pareto limit kicks in at a target extent size of 100K-200K (anything larger than this doesn't get better when defragged, anything smaller kills performance if it's _not_ defragged). 256K may already be larger than optimal for some workloads. > Even if a much needed garbage collection process were to be created, > the smaller extents would mean less data would need to be re-written, > (and potentially duplicated due to snapshots and ref copies.) GC has to take all references into account when computing block reachability, and it has to eliminate all references to remove garbage, so there should not be any new duplicate data. Currently GC has to be implemented by copying the data and then using dedupe to replace references to the original data individually, but that could be optimized with a new kernel ioctl that handles all the references at once with a lock, instead of comparing the data bytes for each one. GC could create smaller extents intentionally, by creating new extents in units of 256K, but reflinking them in reverse order over the original large extents to prevent coalescing extents in writeback. GC would also have to figure out whether the IO cost of splitting the extent is worth the space saving (e.g. don't relocate 100MB of data to save 4K of disk space, wait until it's at least 1MB of space saved). That's a sysadmin policy input. GC is not autodefrag. If it sees that it has to carve up 100M extents for sub-64K writes, GC can create 400x 256K extents to replace the large extents, and only defrag when there's a contiguous range of modified extents with length 64K or less. Or whatever sizes turn out to be the right ones--setting the sizes isn't the hard thing to do here. Obviously, in that scenario it is more efficient if there's a way to not write the 100M extents in the first place, but it quickly reaches a steady state with relatively little wasted space, and doesn't require tuning knobs in the kernel. GC + autodefrag could go the other way, too: make the default extent size small, but allow autodefrag to request very large extents for files that have not been modified in a while. That's inefficient too, but in other other direction, so it would be a better match for the steady state of some workloads (e.g. video recording or log files). Ideally there'd be an "optimum extent size" inheritable inode property, so we can have databases use tiny extents and video recorders use huge extents on the same filesystem. But maybe that's overengineering, and 256K (128K? 512K?) is within the range of values for most. > The fine details on how to implement all of this is way over my head, > but it seemed to me like the logic to keep the extents small is already > more or less already there, and would need relatively very little work > to manifest. There's a #define for maximum new extent length. It wouldn't be too difficult to look up that number in fs_info instead, slightly harder to look it up in an inode. The limit applies only to new extents, so there's no backward compatibility issue with the on-disk format. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-15 18:51 ` Zygo Blaxell @ 2022-03-15 19:22 ` Remi Gauvin 2022-03-15 21:08 ` Zygo Blaxell 0 siblings, 1 reply; 71+ messages in thread From: Remi Gauvin @ 2022-03-15 19:22 UTC (permalink / raw) To: Zygo Blaxell, linux-btrfs On 2022-03-15 2:51 p.m., Zygo Blaxell wrote: > The main advantage of larger extents is smaller metadata, and it doesn't > matter very much whether it's SSD or HDD. Adjacent extents will be in > the same metadata page, so not much is lost with 256K extents even on HDD, > as long as they are physically allocated adjacent to each other. > When I tried enabling compress-force on my HDD storage, it *killed* sequential read performance. I could write a file out at over 100MB/s... but trying to read that same file sequentially would trash the drives with less than 5MB/s actually being able to be read. No such problems were observed on ssd storage. I was under the impression this problem was caused trying to read files with the 127k extents,, which, for whatever reason, could not be done without excessive seeking. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-15 19:22 ` Remi Gauvin @ 2022-03-15 21:08 ` Zygo Blaxell 0 siblings, 0 replies; 71+ messages in thread From: Zygo Blaxell @ 2022-03-15 21:08 UTC (permalink / raw) To: Remi Gauvin; +Cc: linux-btrfs On Tue, Mar 15, 2022 at 03:22:43PM -0400, Remi Gauvin wrote: > On 2022-03-15 2:51 p.m., Zygo Blaxell wrote: > > > The main advantage of larger extents is smaller metadata, and it doesn't > > matter very much whether it's SSD or HDD. Adjacent extents will be in > > the same metadata page, so not much is lost with 256K extents even on HDD, > > as long as they are physically allocated adjacent to each other. > > > > When I tried enabling compress-force on my HDD storage, it *killed* > sequential read performance. I could write a file out at over > 100MB/s... but trying to read that same file sequentially would trash > the drives with less than 5MB/s actually being able to be read. > > > No such problems were observed on ssd storage. I've seen a similar effect. I wonder if the small extents are breaking readahead or something. > I was under the impression this problem was caused trying to read files > with the 127k extents,, which, for whatever reason, could not be done > without excessive seeking. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-14 21:53 ` Jan Ziak 2022-03-14 22:24 ` Remi Gauvin @ 2022-03-15 18:15 ` Phillip Susi 1 sibling, 0 replies; 71+ messages in thread From: Phillip Susi @ 2022-03-15 18:15 UTC (permalink / raw) To: Jan Ziak; +Cc: unlisted-recipients, linux-btrfs Jan Ziak <0xe2.0x9a.0x9b@gmail.com> writes: > However, what the manpage doesn't mention is that, in the case of > btrfs, the above diagram applies not only to compressed extents but to > other types of extents as well. Ok, so if you are using compression then your choices are either to read the entire 128k compressed block, decompress it, update the 4k, recompress it, and write the whole thing back... or just write the modified 4k elsewhere and now there is some wasted space in the compressed block. But why would something like this happen without compression? ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-14 20:02 ` Phillip Susi 2022-03-14 21:53 ` Jan Ziak @ 2022-03-16 16:52 ` Andrei Borzenkov 2022-03-16 18:28 ` Jan Ziak 2022-03-16 18:31 ` Phillip Susi 1 sibling, 2 replies; 71+ messages in thread From: Andrei Borzenkov @ 2022-03-16 16:52 UTC (permalink / raw) To: Phillip Susi, Jan Ziak; +Cc: linux-btrfs On 14.03.2022 23:02, Phillip Susi wrote: > > Jan Ziak <0xe2.0x9a.0x9b@gmail.com> writes: > >> The actual disk usage of a file in a copy-on-write filesystem can be >> much larger than sb.st_size obtained via fstat(fd, &sb) if, for >> example, a program performs many (millions) single-byte file changes >> using write(fd, buf, 1) to distinct/random offsets in the file. > > How? I mean if you write to part of the file a new block is written > somewhere else but the original one is then freed, so the overall size > should not change. Just because all of the blocks of the file are not > contiguous does not mean that the file has more of them, and making them > contiguous does not reduce the number of them. > btrfs does not manage space in fixed size blocks. You describe behavior of WAFL. btrfs manages space in variable size extents. If you change 999 bytes in 1000 bytes extent, original extent remains allocated because 1 byte is still referenced. So actual space consumption is now 1999 bytes. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-16 16:52 ` Andrei Borzenkov @ 2022-03-16 18:28 ` Jan Ziak 2022-03-16 18:31 ` Phillip Susi 1 sibling, 0 replies; 71+ messages in thread From: Jan Ziak @ 2022-03-16 18:28 UTC (permalink / raw) To: Andrei Borzenkov; +Cc: Phillip Susi, linux-btrfs On Wed, Mar 16, 2022 at 5:52 PM Andrei Borzenkov <arvidjaar@gmail.com> wrote: > On 14.03.2022 23:02, Phillip Susi wrote: > > > > Jan Ziak <0xe2.0x9a.0x9b@gmail.com> writes: > > > >> The actual disk usage of a file in a copy-on-write filesystem can be > >> much larger than sb.st_size obtained via fstat(fd, &sb) if, for > >> example, a program performs many (millions) single-byte file changes > >> using write(fd, buf, 1) to distinct/random offsets in the file. > > > > How? I mean if you write to part of the file a new block is written > > somewhere else but the original one is then freed, so the overall size > > should not change. Just because all of the blocks of the file are not > > contiguous does not mean that the file has more of them, and making them > > contiguous does not reduce the number of them. > > > > btrfs does not manage space in fixed size blocks. You describe behavior > of WAFL. The "single-byte file changes using write(fd, buf, 1)" was just an example for the purpose of the discussion. The example isn't related to the 40GB sqlite file. -Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-16 16:52 ` Andrei Borzenkov 2022-03-16 18:28 ` Jan Ziak @ 2022-03-16 18:31 ` Phillip Susi 2022-03-16 18:43 ` Andrei Borzenkov ` (2 more replies) 1 sibling, 3 replies; 71+ messages in thread From: Phillip Susi @ 2022-03-16 18:31 UTC (permalink / raw) To: Andrei Borzenkov; +Cc: Jan Ziak, linux-btrfs Andrei Borzenkov <arvidjaar@gmail.com> writes: > btrfs manages space in variable size extents. If you change 999 bytes in > 1000 bytes extent, original extent remains allocated because 1 byte is > still referenced. So actual space consumption is now 1999 bytes. Huh? You can't really do that because the page cache manages files in 4k pages. If you have a 1M extent and you touch one byte in the file, thus making one 4k page dirty, are you really saying that btrfs will write that modified 4k page somewhere else and NOT free the 4k block that is no longer used? Why the heck not? ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-16 18:31 ` Phillip Susi @ 2022-03-16 18:43 ` Andrei Borzenkov 2022-03-16 18:46 ` Jan Ziak 2022-03-16 19:04 ` Zygo Blaxell 2 siblings, 0 replies; 71+ messages in thread From: Andrei Borzenkov @ 2022-03-16 18:43 UTC (permalink / raw) To: Phillip Susi; +Cc: Jan Ziak, linux-btrfs On 16.03.2022 21:31, Phillip Susi wrote: > > Andrei Borzenkov <arvidjaar@gmail.com> writes: > >> btrfs manages space in variable size extents. If you change 999 bytes in >> 1000 bytes extent, original extent remains allocated because 1 byte is >> still referenced. So actual space consumption is now 1999 bytes. > > Huh? You can't really do that because the page cache manages files in > 4k pages. If you have a 1M extent and you touch one byte in the file, > thus making one 4k page dirty, are you really saying that btrfs will > write that modified 4k page somewhere else and NOT free the 4k block > that is no longer used? yes. > Why the heck not? > Short answer - because it is implemented this way. There could be arbitrary number of overlapping references to this extent. To track all of these references on every write to decide whether extent can be split will likely be prohibitively inefficient. Alternative is to use fixed sized blocks where freeing space is just a matter of reference count. This means increasing size of metadata for keeping track of blocks with unknown impact. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-16 18:31 ` Phillip Susi 2022-03-16 18:43 ` Andrei Borzenkov @ 2022-03-16 18:46 ` Jan Ziak 2022-03-16 19:04 ` Zygo Blaxell 2 siblings, 0 replies; 71+ messages in thread From: Jan Ziak @ 2022-03-16 18:46 UTC (permalink / raw) To: Phillip Susi; +Cc: Andrei Borzenkov, linux-btrfs On Wed, Mar 16, 2022 at 7:35 PM Phillip Susi <phill@thesusis.net> wrote: > Andrei Borzenkov <arvidjaar@gmail.com> writes: > > > btrfs manages space in variable size extents. If you change 999 bytes in > > 1000 bytes extent, original extent remains allocated because 1 byte is > > still referenced. So actual space consumption is now 1999 bytes. > > Huh? You can't really do that because the page cache manages files in > 4k pages. If you have a 1M extent and you touch one byte in the file, > thus making one 4k page dirty, are you really saying that btrfs will > write that modified 4k page somewhere else and NOT free the 4k block > that is no longer used? The questions "Why ... will it write that modified 4k page somewhere else?" and "Why ... not free the 4k block that is no longer used?" are two separate questions. In any CoW filesystem, the answer to the 1st question is: because it is a CoW filesystem. Because it is a basic assumption/premise of the filesystem's design. The answer to the 2nd question depends on whether the CoW filesystem is well-optimized to handle such a scenario or not optimized to handle such a scenario. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-16 18:31 ` Phillip Susi 2022-03-16 18:43 ` Andrei Borzenkov 2022-03-16 18:46 ` Jan Ziak @ 2022-03-16 19:04 ` Zygo Blaxell 2022-03-17 20:34 ` Phillip Susi 2 siblings, 1 reply; 71+ messages in thread From: Zygo Blaxell @ 2022-03-16 19:04 UTC (permalink / raw) To: Phillip Susi; +Cc: Andrei Borzenkov, Jan Ziak, linux-btrfs On Wed, Mar 16, 2022 at 02:31:34PM -0400, Phillip Susi wrote: > > Andrei Borzenkov <arvidjaar@gmail.com> writes: > > > btrfs manages space in variable size extents. If you change 999 bytes in > > 1000 bytes extent, original extent remains allocated because 1 byte is > > still referenced. So actual space consumption is now 1999 bytes. > > Huh? You can't really do that because the page cache manages files in > 4k pages. You can get a 1-byte file reference if you make a reflink of the last block of a 4097-byte file, or punch a hole in the first 4096 bytes of a 4097-byte file. This creates a file containing only a reference to the last byte of the original extent. In theory you could create a 4098-byte file, then make reflinks from that file to two other files covering the last 1 and last 2 bytes; however, that's disallowed in the kernel to make sure that assorted dedupe data leak shenanigans with shared reflinks that don't all end at the same byte in the page can't ever happen. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-16 19:04 ` Zygo Blaxell @ 2022-03-17 20:34 ` Phillip Susi 2022-03-17 22:06 ` Zygo Blaxell 0 siblings, 1 reply; 71+ messages in thread From: Phillip Susi @ 2022-03-17 20:34 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Andrei Borzenkov, Jan Ziak, linux-btrfs Zygo Blaxell <ce3g8jdj@umail.furryterror.org> writes: > You can get a 1-byte file reference if you make a reflink of the last > block of a 4097-byte file, or punch a hole in the first 4096 bytes of a > 4097-byte file. This creates a file containing only a reference to the > last byte of the original extent. So the inode only refers to one byte of the extent, but the extent is still always a multiple of 4k right? ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-17 20:34 ` Phillip Susi @ 2022-03-17 22:06 ` Zygo Blaxell 0 siblings, 0 replies; 71+ messages in thread From: Zygo Blaxell @ 2022-03-17 22:06 UTC (permalink / raw) To: Phillip Susi; +Cc: Andrei Borzenkov, Jan Ziak, linux-btrfs On Thu, Mar 17, 2022 at 04:34:51PM -0400, Phillip Susi wrote: > Zygo Blaxell <ce3g8jdj@umail.furryterror.org> writes: > > You can get a 1-byte file reference if you make a reflink of the last > > block of a 4097-byte file, or punch a hole in the first 4096 bytes of a > > 4097-byte file. This creates a file containing only a reference to the > > last byte of the original extent. > > So the inode only refers to one byte of the extent, but the extent is > still always a multiple of 4k right? Yes. In theory, the on-disk format specifies extent locations and sizes in bytes. In practice, the kernel enforces that all the extent physical boundaries be a multiple of the CPU page size (or multiples of _some_ CPU's page size, with the subpage patches). On read, anything with a logical length that isn't a multiple of 4K is zero-filled. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-06 15:59 Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit Jan Ziak 2022-03-07 0:48 ` Qu Wenruo 2022-03-07 14:30 ` Phillip Susi @ 2022-03-16 12:47 ` Kai Krakow 2022-03-16 18:18 ` Jan Ziak 2 siblings, 1 reply; 71+ messages in thread From: Kai Krakow @ 2022-03-16 12:47 UTC (permalink / raw) To: Jan Ziak; +Cc: linux-btrfs Hello! Am So., 6. März 2022 um 18:57 Uhr schrieb Jan Ziak <0xe2.0x9a.0x9b@gmail.com>: > > I would like to report that btrfs in Linux kernel 5.16.12 mounted with > the autodefrag option wrote 5TB in a single day to a 1TB SSD that is > about 50% full. > > Defragmenting 0.5TB on a drive that is 50% full should write far less than 5TB. > > Benefits to the fragmentation of the most written files over the > course of the one day (sqlite database files) are nil. Please see the > data below. Also note that the sqlite file is using up to 10 GB more > than it should due to fragmentation. > > CPU utilization on an otherwise idle machine is approximately 600% all > the time: btrfs-cleaner 100%, kworkers...btrfs 500%. > > I am not just asking you to fix this issue - I am asking you how is it > possible for an algorithm that is significantly worse than O(N*log(N)) > to be merged into the Linux kernel in the first place!? > > Please try to avoid discussing no-CoW (chattr +C) in your response, > because it is beside the point. Thanks. Yeah, that's one solution. But you could also try disabling double-write by turning on WAL-mode in sqlite: Use the cmdline tool to connect to the database file, then run "PRAGMA journal_mode=WAL;". This can only be switched, when only one client is connect so you need to temporarily suspend processes using the database. (https://dev.to/lefebvre/speed-up-sqlite-with-write-ahead-logging-wal-do) It may be worth disabling compression: "chmod +m DIRECTORY-OF-SQLITE-DB", but this can only be switched for newly created files, so you'd need to rename and copy the existing sqlite files. This reduces the amount of extents created. Enabling WAL disables the rollback journal and prevents smallish in-place updates of data blocks in the database file. Instead, it uses checkpointing to update the database safely in bigger chunks, thus using write-patterns better suited for cow-filesystems. HTH Kai ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit 2022-03-16 12:47 ` Kai Krakow @ 2022-03-16 18:18 ` Jan Ziak 0 siblings, 0 replies; 71+ messages in thread From: Jan Ziak @ 2022-03-16 18:18 UTC (permalink / raw) To: Kai Krakow; +Cc: linux-btrfs On Wed, Mar 16, 2022 at 1:48 PM Kai Krakow <hurikhan77+btrfs@gmail.com> wrote: > Am So., 6. März 2022 um 18:57 Uhr schrieb Jan Ziak <0xe2.0x9a.0x9b@gmail.com>: > > Please try to avoid discussing no-CoW (chattr +C) in your response, > > because it is beside the point. Thanks. > > Yeah, that's one solution. But you could also try disabling > double-write by turning on WAL-mode in sqlite: As far as I can tell, the app is using journal_mode=wal for all database connections using code such as: c = await aiosqlite.connect(db_path) await c.execute("pragma journal_mode=wal") There are sqlite-wal files in all database directories. Compression is disabled. According to Bash history, I executed "btrfs filesystem defragment -r" in the 41 GB sqlite directory about 2 days ago. The current number of extents (after 2 days) is: $ compsize file.sqlite Processed 1 file, 1438640 regular extents (2593549 refs), 0 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 50G 50G 41G none 100% 50G 50G 41G -Jan ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit @ 2022-06-17 0:20 Jan Ziak 0 siblings, 0 replies; 71+ messages in thread From: Jan Ziak @ 2022-06-17 0:20 UTC (permalink / raw) To: linux-btrfs This is a random update to previously reported btrfs fragmentation issues. Defragmenting a 79 GiB file increased the number of bytes allocated to the file in a btrfs filesystem from 118 GIB to 161 GiB: linux 5.17.5 btrfs-progs 5.18.1 $ compsize file.sqlite Type Perc Disk Usage Uncompressed Referenced TOTAL 99% 117G 118G 78G none 100% 116G 116G 77G zstd 30% 471M 1.5G 1.2G $ btrfs filesystem defragment -t 256K file.sqlite $ compsize file.sqlite Type Perc Disk Usage Uncompressed Referenced TOTAL 99% 160G 161G 78G none 100% 159G 159G 77G zstd 28% 405M 1.3G 1.3G $ dd if=file.sqlite of=file-1.sqlite bs=1M status=progress 84659167232 bytes (85 GB, 79 GiB) copied, 122.376 s, 692 MB/s $ compsize file-1.sqlite Type Perc Disk Usage Uncompressed Referenced TOTAL 98% 77G 78G 78G none 100% 77G 77G 77G zstd 28% 361M 1.2G 1.2G ^ permalink raw reply [flat|nested] 71+ messages in thread
end of thread, other threads:[~2022-06-17 0:21 UTC | newest] Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-03-06 15:59 Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit Jan Ziak 2022-03-07 0:48 ` Qu Wenruo 2022-03-07 2:23 ` Jan Ziak 2022-03-07 2:39 ` Qu Wenruo 2022-03-07 7:31 ` Qu Wenruo 2022-03-10 1:10 ` Jan Ziak 2022-03-10 1:26 ` Qu Wenruo 2022-03-10 4:33 ` Jan Ziak 2022-03-10 6:42 ` Qu Wenruo 2022-03-10 21:31 ` Jan Ziak 2022-03-10 23:27 ` Qu Wenruo 2022-03-11 2:42 ` Jan Ziak 2022-03-11 2:59 ` Qu Wenruo 2022-03-11 5:04 ` Jan Ziak 2022-03-11 16:31 ` Jan Ziak 2022-03-11 20:02 ` Jan Ziak 2022-03-11 23:04 ` Qu Wenruo 2022-03-11 23:28 ` Jan Ziak 2022-03-11 23:39 ` Qu Wenruo 2022-03-12 0:01 ` Jan Ziak 2022-03-12 0:15 ` Qu Wenruo 2022-03-12 3:16 ` Zygo Blaxell 2022-03-12 2:43 ` Zygo Blaxell 2022-03-12 3:24 ` Qu Wenruo 2022-03-12 3:48 ` Zygo Blaxell 2022-03-14 20:09 ` Phillip Susi 2022-03-14 22:59 ` Zygo Blaxell 2022-03-15 18:28 ` Phillip Susi 2022-03-15 19:28 ` Jan Ziak 2022-03-15 21:06 ` Zygo Blaxell 2022-03-15 22:20 ` Jan Ziak 2022-03-16 17:02 ` Zygo Blaxell 2022-03-16 17:48 ` Jan Ziak 2022-03-17 2:11 ` Zygo Blaxell 2022-03-16 18:46 ` Phillip Susi 2022-03-16 19:59 ` Zygo Blaxell 2022-03-20 17:50 ` Forza 2022-03-20 21:15 ` Zygo Blaxell 2022-03-08 21:57 ` Jan Ziak 2022-03-08 23:40 ` Qu Wenruo 2022-03-09 22:22 ` Jan Ziak 2022-03-09 22:44 ` Qu Wenruo 2022-03-09 22:55 ` Jan Ziak 2022-03-09 23:00 ` Jan Ziak 2022-03-09 4:48 ` Zygo Blaxell 2022-03-07 14:30 ` Phillip Susi 2022-03-08 21:43 ` Jan Ziak 2022-03-09 18:46 ` Phillip Susi 2022-03-09 21:35 ` Jan Ziak 2022-03-14 20:02 ` Phillip Susi 2022-03-14 21:53 ` Jan Ziak 2022-03-14 22:24 ` Remi Gauvin 2022-03-14 22:51 ` Zygo Blaxell 2022-03-14 23:07 ` Remi Gauvin 2022-03-14 23:39 ` Zygo Blaxell 2022-03-15 14:14 ` Remi Gauvin 2022-03-15 18:51 ` Zygo Blaxell 2022-03-15 19:22 ` Remi Gauvin 2022-03-15 21:08 ` Zygo Blaxell 2022-03-15 18:15 ` Phillip Susi 2022-03-16 16:52 ` Andrei Borzenkov 2022-03-16 18:28 ` Jan Ziak 2022-03-16 18:31 ` Phillip Susi 2022-03-16 18:43 ` Andrei Borzenkov 2022-03-16 18:46 ` Jan Ziak 2022-03-16 19:04 ` Zygo Blaxell 2022-03-17 20:34 ` Phillip Susi 2022-03-17 22:06 ` Zygo Blaxell 2022-03-16 12:47 ` Kai Krakow 2022-03-16 18:18 ` Jan Ziak 2022-06-17 0:20 Jan Ziak
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.