* [POC for v5.15 0/2] btrfs: defrag: what if v5.15 is doing proper defrag @ 2022-01-25 6:50 Qu Wenruo 2022-01-25 6:50 ` [POC for v5.15 1/2] btrfs: defrag: don't defrag preallocated extents Qu Wenruo ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Qu Wenruo @ 2022-01-25 6:50 UTC (permalink / raw) To: linux-btrfs ** DON'T MERGE, THIS IS JUST A PROOF OF CONCEPT ** There are several reports about v5.16 btrfs autodefrag is causing more IO than v5.15. But it turns out that, commit f458a3873ae ("btrfs: fix race when defragmenting leads to unnecessary IO") is making defrags doing less work than it should. Thus damping the IO for autodefrag. This POC series is to make v5.15 kernel to do proper defrag of all good candidates while still not defrag any hole/preallocated range. The test script here looks like this: wipefs -fa $dev mkfs.btrfs -f $dev -U $uuid > /dev/null mount $dev $mnt -o autodefrag $fsstress -w -n 2000 -p 1 -d $mnt -s 1642319517 sync echo "=== baseline ===" cat /sys/fs/btrfs/$uuid/debug/io_accounting/data_write echo 0 > /sys/fs/btrfs/$uuid/debug/cleaner_trigger sleep 3 sync echo "=== after autodefrag ===" cat /sys/fs/btrfs/$uuid/debug/io_accounting/data_write umount $mnt <uuid>/debug/io_accounting/data_write is the new debug features showing how many bytes has been written for a btrfs. The numbers are before chunk mapping. cleaer_trigger is the trigger to wake up cleaner_kthread so autodefrag can do its work. Now there is result: | Data bytes written | Diff to baseline ----------------+--------------------+------------------ no autodefrag | 36896768 | 0 v5.15 vanilla | 40079360 | +8.6% v5.15 POC | 42491904 | +15.2% v5.16 fixes | 42536960 | +15.3% The data shows, although v5.15 vanilla is really causing the least amount of IO for autodefrag, if v5.15 is patched with POC to do proper defrag, the final IO is almost the same as v5.16 with submitted fixes. So this proves that, the v5.15 has lower IO is not a valid default, but a regression which leads to less efficient defrag. And the IO increase is in fact a proof of a regression being fixed. Qu Wenruo (2): btrfs: defrag: don't defrag preallocated extents btrfs: defrag: limit cluster size to the first hole/prealloc range fs/btrfs/ioctl.c | 48 ++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 42 insertions(+), 6 deletions(-) -- 2.34.1 ^ permalink raw reply [flat|nested] 18+ messages in thread
* [POC for v5.15 1/2] btrfs: defrag: don't defrag preallocated extents 2022-01-25 6:50 [POC for v5.15 0/2] btrfs: defrag: what if v5.15 is doing proper defrag Qu Wenruo @ 2022-01-25 6:50 ` Qu Wenruo 2022-01-25 6:50 ` [POC for v5.15 2/2] btrfs: defrag: limit cluster size to the first hole/prealloc range Qu Wenruo 2022-01-25 10:37 ` [POC for v5.15 0/2] btrfs: defrag: what if v5.15 is doing proper defrag Filipe Manana 2 siblings, 0 replies; 18+ messages in thread From: Qu Wenruo @ 2022-01-25 6:50 UTC (permalink / raw) To: linux-btrfs This is the same patch as "btrfs: defrag: don't try to merge regular extents with preallocated extents" but for v5.15. Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/ioctl.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 0ae5777bf056..8350864a4bd8 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1152,19 +1152,23 @@ static bool defrag_check_next_extent(struct inode *inode, struct extent_map *em, bool locked) { struct extent_map *next; - bool ret = true; + bool ret = false; /* this is the last extent */ if (em->start + em->len >= i_size_read(inode)) - return false; + return ret; next = defrag_lookup_extent(inode, em->start + em->len, locked); if (!next || next->block_start >= EXTENT_MAP_LAST_BYTE) - ret = false; - else if ((em->block_start + em->block_len == next->block_start) && - (em->block_len > SZ_128K && next->block_len > SZ_128K)) - ret = false; + goto out; + if (test_bit(EXTENT_FLAG_COMPRESSED, &next->flags)) + goto out; + if ((em->block_start + em->block_len == next->block_start) && + (em->block_len > SZ_128K && next->block_len > SZ_128K)) + goto out; + ret = true; +out: free_extent_map(next); return ret; } -- 2.34.1 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* [POC for v5.15 2/2] btrfs: defrag: limit cluster size to the first hole/prealloc range 2022-01-25 6:50 [POC for v5.15 0/2] btrfs: defrag: what if v5.15 is doing proper defrag Qu Wenruo 2022-01-25 6:50 ` [POC for v5.15 1/2] btrfs: defrag: don't defrag preallocated extents Qu Wenruo @ 2022-01-25 6:50 ` Qu Wenruo 2022-01-25 10:37 ` [POC for v5.15 0/2] btrfs: defrag: what if v5.15 is doing proper defrag Filipe Manana 2 siblings, 0 replies; 18+ messages in thread From: Qu Wenruo @ 2022-01-25 6:50 UTC (permalink / raw) To: linux-btrfs We always use 256K as cluster size if possible, but it's very common that there are only several small regular extents which are target for defrag. But those extents are not filling the whole 256K cluster, and the holes after those extents will make defrag to reject the whole cluster. This patch will mitigate the problem by doing another search for the hole/preallocated range, and limit the size of the cluster just to the regular extents part. By this, defrag can do more of its work without being interrupted by a hole. Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/ioctl.c | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 8350864a4bd8..b66bb10e2e4a 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1481,6 +1481,29 @@ static int cluster_pages_for_defrag(struct inode *inode, } +static u64 get_cluster_size(struct inode *inode, u64 start, u64 len) +{ + u64 cur = start; + u64 cluster_len = 0; + while (cur < start + len) { + struct extent_map *em; + + em = defrag_lookup_extent(inode, cur, false); + if (!em) + break; + /* + * Here we don't do comprehensive checks, we just + * find the first hole/preallocated. + */ + if (em->block_start >= EXTENT_MAP_LAST_BYTE || + test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) + break; + cluster_len += min(em->start + em->len - cur, start + len - cur); + cur = min(em->start + em->len, start + len); + } + return cluster_len; +} + /* * Entry point to file defragmentation. * @@ -1618,6 +1641,13 @@ int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra, } else { cluster = max_cluster; } + cluster = min_t(unsigned long, + get_cluster_size(inode, i << PAGE_SHIFT, + cluster << PAGE_SHIFT) + >> PAGE_SHIFT, + cluster); + if (cluster == 0) + goto next; if (i + cluster > ra_index) { ra_index = max(i, ra_index); @@ -1644,6 +1674,7 @@ int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra, balance_dirty_pages_ratelimited(inode->i_mapping); btrfs_inode_unlock(inode, 0); +next: if (newer_than) { if (newer_off == (u64)-1) break; -- 2.34.1 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [POC for v5.15 0/2] btrfs: defrag: what if v5.15 is doing proper defrag 2022-01-25 6:50 [POC for v5.15 0/2] btrfs: defrag: what if v5.15 is doing proper defrag Qu Wenruo 2022-01-25 6:50 ` [POC for v5.15 1/2] btrfs: defrag: don't defrag preallocated extents Qu Wenruo 2022-01-25 6:50 ` [POC for v5.15 2/2] btrfs: defrag: limit cluster size to the first hole/prealloc range Qu Wenruo @ 2022-01-25 10:37 ` Filipe Manana 2022-01-25 10:55 ` Qu Wenruo 2 siblings, 1 reply; 18+ messages in thread From: Filipe Manana @ 2022-01-25 10:37 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs On Tue, Jan 25, 2022 at 02:50:55PM +0800, Qu Wenruo wrote: > ** DON'T MERGE, THIS IS JUST A PROOF OF CONCEPT ** > > There are several reports about v5.16 btrfs autodefrag is causing more > IO than v5.15. > > But it turns out that, commit f458a3873ae ("btrfs: fix race when > defragmenting leads to unnecessary IO") is making defrags doing less > work than it should. > Thus damping the IO for autodefrag. > > This POC series is to make v5.15 kernel to do proper defrag of all good > candidates while still not defrag any hole/preallocated range. > > The test script here looks like this: > > wipefs -fa $dev > mkfs.btrfs -f $dev -U $uuid > /dev/null > mount $dev $mnt -o autodefrag > $fsstress -w -n 2000 -p 1 -d $mnt -s 1642319517 > sync > echo "=== baseline ===" > cat /sys/fs/btrfs/$uuid/debug/io_accounting/data_write > echo 0 > /sys/fs/btrfs/$uuid/debug/cleaner_trigger > sleep 3 > sync > echo "=== after autodefrag ===" > cat /sys/fs/btrfs/$uuid/debug/io_accounting/data_write > umount $mnt > > <uuid>/debug/io_accounting/data_write is the new debug features showing > how many bytes has been written for a btrfs. > The numbers are before chunk mapping. > cleaer_trigger is the trigger to wake up cleaner_kthread so autodefrag > can do its work. > > Now there is result: > > | Data bytes written | Diff to baseline > ----------------+--------------------+------------------ > no autodefrag | 36896768 | 0 > v5.15 vanilla | 40079360 | +8.6% > v5.15 POC | 42491904 | +15.2% > v5.16 fixes | 42536960 | +15.3% > > The data shows, although v5.15 vanilla is really causing the least > amount of IO for autodefrag, if v5.15 is patched with POC to do proper > defrag, the final IO is almost the same as v5.16 with submitted fixes. > > So this proves that, the v5.15 has lower IO is not a valid default, but > a regression which leads to less efficient defrag. > > And the IO increase is in fact a proof of a regression being fixed. Are you sure that's the only thing? Users report massive IO difference, 15% more does not seem to be massive. François for example reported a difference of 10 ops/s vs 1k ops/s [1] It also does not explain the 100% cpu usage of the cleaner kthread. Scanning the whole file based on extent maps and not using btrfs_search_forward() anymore, as discussed yesterday on slack, can however contribute to much higher cpu usage. [1] https://lore.kernel.org/linux-btrfs/CAEwRaO4y3PPPUdwYjNDoB9m9CLzfd3DFFk2iK1X6OyyEWG5-mg@mail.gmail.com/ Thanks. > > Qu Wenruo (2): > btrfs: defrag: don't defrag preallocated extents > btrfs: defrag: limit cluster size to the first hole/prealloc range > > fs/btrfs/ioctl.c | 48 ++++++++++++++++++++++++++++++++++++++++++------ > 1 file changed, 42 insertions(+), 6 deletions(-) > > -- > 2.34.1 > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [POC for v5.15 0/2] btrfs: defrag: what if v5.15 is doing proper defrag 2022-01-25 10:37 ` [POC for v5.15 0/2] btrfs: defrag: what if v5.15 is doing proper defrag Filipe Manana @ 2022-01-25 10:55 ` Qu Wenruo 2022-01-25 11:05 ` Qu Wenruo 0 siblings, 1 reply; 18+ messages in thread From: Qu Wenruo @ 2022-01-25 10:55 UTC (permalink / raw) To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs On 2022/1/25 18:37, Filipe Manana wrote: > On Tue, Jan 25, 2022 at 02:50:55PM +0800, Qu Wenruo wrote: >> ** DON'T MERGE, THIS IS JUST A PROOF OF CONCEPT ** >> >> There are several reports about v5.16 btrfs autodefrag is causing more >> IO than v5.15. >> >> But it turns out that, commit f458a3873ae ("btrfs: fix race when >> defragmenting leads to unnecessary IO") is making defrags doing less >> work than it should. >> Thus damping the IO for autodefrag. >> >> This POC series is to make v5.15 kernel to do proper defrag of all good >> candidates while still not defrag any hole/preallocated range. >> >> The test script here looks like this: >> >> wipefs -fa $dev >> mkfs.btrfs -f $dev -U $uuid > /dev/null >> mount $dev $mnt -o autodefrag >> $fsstress -w -n 2000 -p 1 -d $mnt -s 1642319517 >> sync >> echo "=== baseline ===" >> cat /sys/fs/btrfs/$uuid/debug/io_accounting/data_write >> echo 0 > /sys/fs/btrfs/$uuid/debug/cleaner_trigger >> sleep 3 >> sync >> echo "=== after autodefrag ===" >> cat /sys/fs/btrfs/$uuid/debug/io_accounting/data_write >> umount $mnt >> >> <uuid>/debug/io_accounting/data_write is the new debug features showing >> how many bytes has been written for a btrfs. >> The numbers are before chunk mapping. >> cleaer_trigger is the trigger to wake up cleaner_kthread so autodefrag >> can do its work. >> >> Now there is result: >> >> | Data bytes written | Diff to baseline >> ----------------+--------------------+------------------ >> no autodefrag | 36896768 | 0 >> v5.15 vanilla | 40079360 | +8.6% >> v5.15 POC | 42491904 | +15.2% >> v5.16 fixes | 42536960 | +15.3% >> >> The data shows, although v5.15 vanilla is really causing the least >> amount of IO for autodefrag, if v5.15 is patched with POC to do proper >> defrag, the final IO is almost the same as v5.16 with submitted fixes. >> >> So this proves that, the v5.15 has lower IO is not a valid default, but >> a regression which leads to less efficient defrag. >> >> And the IO increase is in fact a proof of a regression being fixed. > > Are you sure that's the only thing? Not the only thing, but it's proving the baseline of v5.15 is not a reliable one. > Users report massive IO difference, 15% more does not seem to be massive. > François for example reported a difference of 10 ops/s vs 1k ops/s [1] This is just for the seed I'm using. As it provides a reliable and constant baseline where all my previous testing and debugging are based on. It can be definitely way more IO if the load involves more full cluster rejection. > > It also does not explain the 100% cpu usage of the cleaner kthread. > Scanning the whole file based on extent maps and not using > btrfs_search_forward() anymore, as discussed yesterday on slack, can > however contribute to much higher cpu usage. That's definitely one possible reason. But this particular analyse has io_accounting focused on data_write, thus metadata can definitely have its part in it. Nevertheless, this has already shows there are problems in the old autodefrag path and not really exposed. Thanks, Qu > > [1] https://lore.kernel.org/linux-btrfs/CAEwRaO4y3PPPUdwYjNDoB9m9CLzfd3DFFk2iK1X6OyyEWG5-mg@mail.gmail.com/ > > Thanks. > >> >> Qu Wenruo (2): >> btrfs: defrag: don't defrag preallocated extents >> btrfs: defrag: limit cluster size to the first hole/prealloc range >> >> fs/btrfs/ioctl.c | 48 ++++++++++++++++++++++++++++++++++++++++++------ >> 1 file changed, 42 insertions(+), 6 deletions(-) >> >> -- >> 2.34.1 >> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [POC for v5.15 0/2] btrfs: defrag: what if v5.15 is doing proper defrag 2022-01-25 10:55 ` Qu Wenruo @ 2022-01-25 11:05 ` Qu Wenruo 2022-01-25 19:58 ` François-Xavier Thomas 0 siblings, 1 reply; 18+ messages in thread From: Qu Wenruo @ 2022-01-25 11:05 UTC (permalink / raw) To: Filipe Manana, Qu Wenruo, François-Xavier Thomas; +Cc: linux-btrfs On 2022/1/25 18:55, Qu Wenruo wrote: > > > On 2022/1/25 18:37, Filipe Manana wrote: >> On Tue, Jan 25, 2022 at 02:50:55PM +0800, Qu Wenruo wrote: >>> ** DON'T MERGE, THIS IS JUST A PROOF OF CONCEPT ** >>> >>> There are several reports about v5.16 btrfs autodefrag is causing more >>> IO than v5.15. >>> >>> But it turns out that, commit f458a3873ae ("btrfs: fix race when >>> defragmenting leads to unnecessary IO") is making defrags doing less >>> work than it should. >>> Thus damping the IO for autodefrag. >>> >>> This POC series is to make v5.15 kernel to do proper defrag of all good >>> candidates while still not defrag any hole/preallocated range. >>> >>> The test script here looks like this: >>> >>> wipefs -fa $dev >>> mkfs.btrfs -f $dev -U $uuid > /dev/null >>> mount $dev $mnt -o autodefrag >>> $fsstress -w -n 2000 -p 1 -d $mnt -s 1642319517 >>> sync >>> echo "=== baseline ===" >>> cat /sys/fs/btrfs/$uuid/debug/io_accounting/data_write >>> echo 0 > /sys/fs/btrfs/$uuid/debug/cleaner_trigger >>> sleep 3 >>> sync >>> echo "=== after autodefrag ===" >>> cat /sys/fs/btrfs/$uuid/debug/io_accounting/data_write >>> umount $mnt >>> >>> <uuid>/debug/io_accounting/data_write is the new debug features showing >>> how many bytes has been written for a btrfs. >>> The numbers are before chunk mapping. >>> cleaer_trigger is the trigger to wake up cleaner_kthread so autodefrag >>> can do its work. >>> >>> Now there is result: >>> >>> | Data bytes written | Diff to baseline >>> ----------------+--------------------+------------------ >>> no autodefrag | 36896768 | 0 >>> v5.15 vanilla | 40079360 | +8.6% >>> v5.15 POC | 42491904 | +15.2% >>> v5.16 fixes | 42536960 | +15.3% >>> >>> The data shows, although v5.15 vanilla is really causing the least >>> amount of IO for autodefrag, if v5.15 is patched with POC to do proper >>> defrag, the final IO is almost the same as v5.16 with submitted fixes. >>> >>> So this proves that, the v5.15 has lower IO is not a valid default, but >>> a regression which leads to less efficient defrag. >>> >>> And the IO increase is in fact a proof of a regression being fixed. >> >> Are you sure that's the only thing? > > Not the only thing, but it's proving the baseline of v5.15 is not a > reliable one. > >> Users report massive IO difference, 15% more does not seem to be massive. >> François for example reported a difference of 10 ops/s vs 1k ops/s [1] > > This is just for the seed I'm using. > As it provides a reliable and constant baseline where all my previous > testing and debugging are based on. > > It can be definitely way more IO if the load involves more full cluster > rejection. > >> >> It also does not explain the 100% cpu usage of the cleaner kthread. >> Scanning the whole file based on extent maps and not using >> btrfs_search_forward() anymore, as discussed yesterday on slack, can >> however contribute to much higher cpu usage. > > That's definitely one possible reason. > > But this particular analyse has io_accounting focused on data_write, > thus metadata can definitely have its part in it. BTW, with the io_accounting debugging feature, we can easily distinguish metadata read/write from data read/write caused by the fsstress and autodefrag. Although in my environment, the nr_ops are pretty small, thus the whole metadata can be easily cached in memory, thus the btrfs_search_forward() problem is not that observable here. For the real world proof, François, mind to test v5.15 with these two patches applied to see if there is a difference in IO, compared to v5.15 vanilla? Thanks, Qu > > Nevertheless, this has already shows there are problems in the old > autodefrag path and not really exposed. > > Thanks, > Qu > >> >> [1] >> https://lore.kernel.org/linux-btrfs/CAEwRaO4y3PPPUdwYjNDoB9m9CLzfd3DFFk2iK1X6OyyEWG5-mg@mail.gmail.com/ >> >> >> Thanks. >> >>> >>> Qu Wenruo (2): >>> btrfs: defrag: don't defrag preallocated extents >>> btrfs: defrag: limit cluster size to the first hole/prealloc range >>> >>> fs/btrfs/ioctl.c | 48 ++++++++++++++++++++++++++++++++++++++++++------ >>> 1 file changed, 42 insertions(+), 6 deletions(-) >>> >>> -- >>> 2.34.1 >>> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [POC for v5.15 0/2] btrfs: defrag: what if v5.15 is doing proper defrag 2022-01-25 11:05 ` Qu Wenruo @ 2022-01-25 19:58 ` François-Xavier Thomas 2022-02-01 21:18 ` François-Xavier Thomas 0 siblings, 1 reply; 18+ messages in thread From: François-Xavier Thomas @ 2022-01-25 19:58 UTC (permalink / raw) To: Qu Wenruo; +Cc: Filipe Manana, Qu Wenruo, linux-btrfs Hi Qu, all, > For the real world proof, François, mind to test v5.15 with these two > patches applied to see if there is a difference in IO, compared to v5.15 > vanilla? Sure, I'll take the last 5.15 I used (5.15.13) as a base + the 2 patches from this thread, is that good? 1-btrfs-dont-defrag-preallocated-extents.patch 2-defrag-limit-cluster-size-to-first-hole.patch Will post results tomorrow when it finishes compiling. François-Xavier ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [POC for v5.15 0/2] btrfs: defrag: what if v5.15 is doing proper defrag 2022-01-25 19:58 ` François-Xavier Thomas @ 2022-02-01 21:18 ` François-Xavier Thomas 2022-02-02 0:35 ` Qu Wenruo 0 siblings, 1 reply; 18+ messages in thread From: François-Xavier Thomas @ 2022-02-01 21:18 UTC (permalink / raw) To: Qu Wenruo; +Cc: Filipe Manana, Qu Wenruo, linux-btrfs > Sure, I'll take the last 5.15 I used (5.15.13) as a base + the 2 > patches from this thread, is that good? > 1-btrfs-dont-defrag-preallocated-extents.patch > 2-defrag-limit-cluster-size-to-first-hole.patch > > Will post results tomorrow when it finishes compiling. Tomorrow ended up taking a week, as it happens, sorry for that. No significant difference compared to before: https://i.imgur.com/BNoxixL.png I'm not sure what we can infer from this, if I understood correctly this should have shown similar amounts of I/O to 5.16 with all the previous fixes, right? I pasted the full diff I used at the end of this mail, since `defrag_lookup_extent` changed signature between 5.15 and 5.16 I had to do a couple of minor changes to get it to compile on 5.15.13. Hopefully that was what you initially intended on testing. Thanks, François-Xavier ---- diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index cc61813213d8..07eb7417b6e7 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1119,19 +1119,23 @@ static struct extent_map *defrag_lookup_extent(struct inode *inode, u64 start) static bool defrag_check_next_extent(struct inode *inode, struct extent_map *em) { struct extent_map *next; - bool ret = true; + bool ret = false; /* this is the last extent */ if (em->start + em->len >= i_size_read(inode)) - return false; + return ret; next = defrag_lookup_extent(inode, em->start + em->len); if (!next || next->block_start >= EXTENT_MAP_LAST_BYTE) - ret = false; - else if ((em->block_start + em->block_len == next->block_start) && - (em->block_len > SZ_128K && next->block_len > SZ_128K)) - ret = false; + goto out; + if (test_bit(EXTENT_FLAG_COMPRESSED, &next->flags)) + goto out; + if ((em->block_start + em->block_len == next->block_start) && + (em->block_len > SZ_128K && next->block_len > SZ_128K)) + goto out; + ret = true; +out: free_extent_map(next); return ret; } @@ -1403,6 +1407,30 @@ static int cluster_pages_for_defrag(struct inode *inode, } +static u64 get_cluster_size(struct inode *inode, u64 start, u64 len) +{ + u64 cur = start; + u64 cluster_len = 0; + while (cur < start + len) { + struct extent_map *em; + + em = defrag_lookup_extent(inode, cur); + if (!em) + break; + /* + * Here we don't do comprehensive checks, we just + * find the first hole/preallocated. + */ + if (em->block_start >= EXTENT_MAP_LAST_BYTE || + test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) + break; + cluster_len += min(em->start + em->len - cur, start + len - cur); + cur = min(em->start + em->len, start + len); + } + return cluster_len; +} + + int btrfs_defrag_file(struct inode *inode, struct file *file, struct btrfs_ioctl_defrag_range_args *range, u64 newer_than, unsigned long max_to_defrag) @@ -1531,6 +1559,13 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, } else { cluster = max_cluster; } + cluster = min_t(unsigned long, + get_cluster_size(inode, i << PAGE_SHIFT, + cluster << PAGE_SHIFT) + >> PAGE_SHIFT, + cluster); + if (cluster == 0) + goto next; if (i + cluster > ra_index) { ra_index = max(i, ra_index); @@ -1557,6 +1592,7 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, balance_dirty_pages_ratelimited(inode->i_mapping); btrfs_inode_unlock(inode, 0); +next: if (newer_than) { if (newer_off == (u64)-1) break; ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [POC for v5.15 0/2] btrfs: defrag: what if v5.15 is doing proper defrag 2022-02-01 21:18 ` François-Xavier Thomas @ 2022-02-02 0:35 ` Qu Wenruo 2022-02-02 1:18 ` Qu Wenruo 0 siblings, 1 reply; 18+ messages in thread From: Qu Wenruo @ 2022-02-02 0:35 UTC (permalink / raw) To: François-Xavier Thomas; +Cc: Filipe Manana, Qu Wenruo, linux-btrfs On 2022/2/2 05:18, François-Xavier Thomas wrote: >> Sure, I'll take the last 5.15 I used (5.15.13) as a base + the 2 >> patches from this thread, is that good? >> 1-btrfs-dont-defrag-preallocated-extents.patch >> 2-defrag-limit-cluster-size-to-first-hole.patch >> >> Will post results tomorrow when it finishes compiling. > > Tomorrow ended up taking a week, as it happens, sorry for that. No > significant difference compared to before: > https://i.imgur.com/BNoxixL.png > > I'm not sure what we can infer from this, if I understood correctly > this should have shown similar amounts of I/O to 5.16 with all the > previous fixes, right? OK, things are getting more complex now. Unless your workload includes quite some preallocated range, it's definitely not what I expected. As a final struggle, mind to test the small diff (without any previous diff) against v5.15? Thanks, Qu diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index cc61813213d8..bcc075ea56c9 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1528,8 +1528,6 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, cluster = (PAGE_ALIGN(defrag_end) >> PAGE_SHIFT) - i; cluster = min(cluster, max_cluster); - } else { - cluster = max_cluster; } if (i + cluster > ra_index) { > > I pasted the full diff I used at the end of this mail, since > `defrag_lookup_extent` changed signature between 5.15 and 5.16 I had > to do a couple of minor changes to get it to compile on 5.15.13. > Hopefully that was what you initially intended on testing. > > Thanks, > François-Xavier > > ---- > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > index cc61813213d8..07eb7417b6e7 100644 > --- a/fs/btrfs/ioctl.c > +++ b/fs/btrfs/ioctl.c > @@ -1119,19 +1119,23 @@ static struct extent_map > *defrag_lookup_extent(struct inode *inode, u64 start) > static bool defrag_check_next_extent(struct inode *inode, struct > extent_map *em) > { > struct extent_map *next; > - bool ret = true; > + bool ret = false; > > /* this is the last extent */ > if (em->start + em->len >= i_size_read(inode)) > - return false; > + return ret; > > next = defrag_lookup_extent(inode, em->start + em->len); > if (!next || next->block_start >= EXTENT_MAP_LAST_BYTE) > - ret = false; > - else if ((em->block_start + em->block_len == next->block_start) && > - (em->block_len > SZ_128K && next->block_len > SZ_128K)) > - ret = false; > + goto out; > + if (test_bit(EXTENT_FLAG_COMPRESSED, &next->flags)) > + goto out; > + if ((em->block_start + em->block_len == next->block_start) && > + (em->block_len > SZ_128K && next->block_len > SZ_128K)) > + goto out; > > + ret = true; > +out: > free_extent_map(next); > return ret; > } > @@ -1403,6 +1407,30 @@ static int cluster_pages_for_defrag(struct inode *inode, > > } > > +static u64 get_cluster_size(struct inode *inode, u64 start, u64 len) > +{ > + u64 cur = start; > + u64 cluster_len = 0; > + while (cur < start + len) { > + struct extent_map *em; > + > + em = defrag_lookup_extent(inode, cur); > + if (!em) > + break; > + /* > + * Here we don't do comprehensive checks, we just > + * find the first hole/preallocated. > + */ > + if (em->block_start >= EXTENT_MAP_LAST_BYTE || > + test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) > + break; > + cluster_len += min(em->start + em->len - cur, start + len - cur); > + cur = min(em->start + em->len, start + len); > + } > + return cluster_len; > +} > + > + > int btrfs_defrag_file(struct inode *inode, struct file *file, > struct btrfs_ioctl_defrag_range_args *range, > u64 newer_than, unsigned long max_to_defrag) > @@ -1531,6 +1559,13 @@ int btrfs_defrag_file(struct inode *inode, > struct file *file, > } else { > cluster = max_cluster; > } > + cluster = min_t(unsigned long, > + get_cluster_size(inode, i << PAGE_SHIFT, > + cluster << PAGE_SHIFT) > + >> PAGE_SHIFT, > + cluster); > + if (cluster == 0) > + goto next; > > if (i + cluster > ra_index) { > ra_index = max(i, ra_index); > @@ -1557,6 +1592,7 @@ int btrfs_defrag_file(struct inode *inode, > struct file *file, > balance_dirty_pages_ratelimited(inode->i_mapping); > btrfs_inode_unlock(inode, 0); > > +next: > if (newer_than) { > if (newer_off == (u64)-1) > break; ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [POC for v5.15 0/2] btrfs: defrag: what if v5.15 is doing proper defrag 2022-02-02 0:35 ` Qu Wenruo @ 2022-02-02 1:18 ` Qu Wenruo 2022-02-02 19:01 ` François-Xavier Thomas 0 siblings, 1 reply; 18+ messages in thread From: Qu Wenruo @ 2022-02-02 1:18 UTC (permalink / raw) To: François-Xavier Thomas; +Cc: Filipe Manana, Qu Wenruo, linux-btrfs On 2022/2/2 08:35, Qu Wenruo wrote: > > > On 2022/2/2 05:18, François-Xavier Thomas wrote: >>> Sure, I'll take the last 5.15 I used (5.15.13) as a base + the 2 >>> patches from this thread, is that good? >>> 1-btrfs-dont-defrag-preallocated-extents.patch >>> 2-defrag-limit-cluster-size-to-first-hole.patch >>> >>> Will post results tomorrow when it finishes compiling. >> >> Tomorrow ended up taking a week, as it happens, sorry for that. No >> significant difference compared to before: >> https://i.imgur.com/BNoxixL.png >> >> I'm not sure what we can infer from this, if I understood correctly >> this should have shown similar amounts of I/O to 5.16 with all the >> previous fixes, right? > > OK, things are getting more complex now. > > Unless your workload includes quite some preallocated range, it's > definitely not what I expected. > > As a final struggle, mind to test the small diff (without any previous > diff) against v5.15? > > Thanks, > Qu > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > index cc61813213d8..bcc075ea56c9 100644 > --- a/fs/btrfs/ioctl.c > +++ b/fs/btrfs/ioctl.c > @@ -1528,8 +1528,6 @@ int btrfs_defrag_file(struct inode *inode, struct > file *file, > cluster = (PAGE_ALIGN(defrag_end) >> > PAGE_SHIFT) - i; > cluster = min(cluster, max_cluster); > - } else { > - cluster = max_cluster; > } > > if (i + cluster > ra_index) { Wait, this is not correct, the correct one is this: diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index cc61813213d8..f6f2468d4883 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1524,13 +1524,8 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, continue; } - if (!newer_than) { - cluster = (PAGE_ALIGN(defrag_end) >> - PAGE_SHIFT) - i; - cluster = min(cluster, max_cluster); - } else { - cluster = max_cluster; - } + cluster = (PAGE_ALIGN(defrag_end) >> PAGE_SHIFT) - i; + cluster = min(cluster, max_cluster); if (i + cluster > ra_index) { ra_index = max(i, ra_index); > > >> >> I pasted the full diff I used at the end of this mail, since >> `defrag_lookup_extent` changed signature between 5.15 and 5.16 I had >> to do a couple of minor changes to get it to compile on 5.15.13. >> Hopefully that was what you initially intended on testing. >> >> Thanks, >> François-Xavier >> >> ---- >> >> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c >> index cc61813213d8..07eb7417b6e7 100644 >> --- a/fs/btrfs/ioctl.c >> +++ b/fs/btrfs/ioctl.c >> @@ -1119,19 +1119,23 @@ static struct extent_map >> *defrag_lookup_extent(struct inode *inode, u64 start) >> static bool defrag_check_next_extent(struct inode *inode, struct >> extent_map *em) >> { >> struct extent_map *next; >> - bool ret = true; >> + bool ret = false; >> >> /* this is the last extent */ >> if (em->start + em->len >= i_size_read(inode)) >> - return false; >> + return ret; >> >> next = defrag_lookup_extent(inode, em->start + em->len); >> if (!next || next->block_start >= EXTENT_MAP_LAST_BYTE) >> - ret = false; >> - else if ((em->block_start + em->block_len == next->block_start) && >> - (em->block_len > SZ_128K && next->block_len > SZ_128K)) >> - ret = false; >> + goto out; >> + if (test_bit(EXTENT_FLAG_COMPRESSED, &next->flags)) >> + goto out; >> + if ((em->block_start + em->block_len == next->block_start) && >> + (em->block_len > SZ_128K && next->block_len > SZ_128K)) >> + goto out; >> >> + ret = true; >> +out: >> free_extent_map(next); >> return ret; >> } >> @@ -1403,6 +1407,30 @@ static int cluster_pages_for_defrag(struct >> inode *inode, >> >> } >> >> +static u64 get_cluster_size(struct inode *inode, u64 start, u64 len) >> +{ >> + u64 cur = start; >> + u64 cluster_len = 0; >> + while (cur < start + len) { >> + struct extent_map *em; >> + >> + em = defrag_lookup_extent(inode, cur); >> + if (!em) >> + break; >> + /* >> + * Here we don't do comprehensive checks, we just >> + * find the first hole/preallocated. >> + */ >> + if (em->block_start >= EXTENT_MAP_LAST_BYTE || >> + test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) >> + break; >> + cluster_len += min(em->start + em->len - cur, start + len - >> cur); >> + cur = min(em->start + em->len, start + len); >> + } >> + return cluster_len; >> +} >> + >> + >> int btrfs_defrag_file(struct inode *inode, struct file *file, >> struct btrfs_ioctl_defrag_range_args *range, >> u64 newer_than, unsigned long max_to_defrag) >> @@ -1531,6 +1559,13 @@ int btrfs_defrag_file(struct inode *inode, >> struct file *file, >> } else { >> cluster = max_cluster; >> } >> + cluster = min_t(unsigned long, >> + get_cluster_size(inode, i << PAGE_SHIFT, >> + cluster << PAGE_SHIFT) >> + >> PAGE_SHIFT, >> + cluster); >> + if (cluster == 0) >> + goto next; >> >> if (i + cluster > ra_index) { >> ra_index = max(i, ra_index); >> @@ -1557,6 +1592,7 @@ int btrfs_defrag_file(struct inode *inode, >> struct file *file, >> balance_dirty_pages_ratelimited(inode->i_mapping); >> btrfs_inode_unlock(inode, 0); >> >> +next: >> if (newer_than) { >> if (newer_off == (u64)-1) >> break; ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [POC for v5.15 0/2] btrfs: defrag: what if v5.15 is doing proper defrag 2022-02-02 1:18 ` Qu Wenruo @ 2022-02-02 19:01 ` François-Xavier Thomas 2022-02-04 9:32 ` François-Xavier Thomas 0 siblings, 1 reply; 18+ messages in thread From: François-Xavier Thomas @ 2022-02-02 19:01 UTC (permalink / raw) To: Qu Wenruo; +Cc: Filipe Manana, Qu Wenruo, linux-btrfs > As a final struggle, mind to test the small diff (without any previous diff) against v5.15? Sure, will let you know when I get the results! François-Xavier ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [POC for v5.15 0/2] btrfs: defrag: what if v5.15 is doing proper defrag 2022-02-02 19:01 ` François-Xavier Thomas @ 2022-02-04 9:32 ` François-Xavier Thomas 2022-02-04 9:49 ` Qu Wenruo 0 siblings, 1 reply; 18+ messages in thread From: François-Xavier Thomas @ 2022-02-04 9:32 UTC (permalink / raw) To: Qu Wenruo; +Cc: Filipe Manana, Qu Wenruo, linux-btrfs The increase in I/O is small but noticeable: https://i.imgur.com/IJ4lLI2.png I also checked 5.16.5 just in case, and the difference is still marked. François-Xavier ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [POC for v5.15 0/2] btrfs: defrag: what if v5.15 is doing proper defrag 2022-02-04 9:32 ` François-Xavier Thomas @ 2022-02-04 9:49 ` Qu Wenruo 2022-02-04 11:05 ` François-Xavier Thomas 0 siblings, 1 reply; 18+ messages in thread From: Qu Wenruo @ 2022-02-04 9:49 UTC (permalink / raw) To: François-Xavier Thomas, Qu Wenruo; +Cc: Filipe Manana, linux-btrfs On 2022/2/4 17:32, François-Xavier Thomas wrote: > The increase in I/O is small but noticeable: > https://i.imgur.com/IJ4lLI2.png > > I also checked 5.16.5 just in case, and the difference is still marked. > > François-Xavier > OK, then it means there is still something else involved. But one thing to mention is, the color scheme is little weird to me. Although the legend shows dark yellow (?) is read bytes/s, but in the graph is more like pink. The same applies to dark green (?) which is more like regular green for sda write bytes/s. So what I got from the graph is: - v5.15.13 vanilla Less than 3 MB/s write for autodefrag, And has burst read around 15MB/s. - v5.15.13 + final patch Approaching 5 MB/s write, and neat 20MB/s burst read. I believe this should be our base line. - v5.16.5 vanilla 10MB/s write, near 35 MB/s burst read. For the write part, I don't really have better idea. But for the read part, I'm working to bring back two optimization. One is to skip large range which is not our target. That patchset has been sent to the mail list: https://patchwork.kernel.org/project/linux-btrfs/list/?series=611214 The other one is to reduce the metadata IO and memory usage for extent map searching. In theory it should reduce the metadata IO, but I'm not yet confident enough. Will give you an update when the testing branch is fully prepared. Thanks, Qu ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [POC for v5.15 0/2] btrfs: defrag: what if v5.15 is doing proper defrag 2022-02-04 9:49 ` Qu Wenruo @ 2022-02-04 11:05 ` François-Xavier Thomas 2022-02-04 11:21 ` Qu Wenruo 2022-02-05 10:19 ` Qu Wenruo 0 siblings, 2 replies; 18+ messages in thread From: François-Xavier Thomas @ 2022-02-04 11:05 UTC (permalink / raw) To: Qu Wenruo; +Cc: Qu Wenruo, Filipe Manana, linux-btrfs > But one thing to mention is, the color scheme is little weird to me. Good point, here's a custom graph with just the write rate, that should be easier to read than the default graph with everything: https://i.imgur.com/vlRPOFr.png Waiting for your next update then. In the mean time are there other statistics I should collect that would make this easier to debug? Thanks, François-Xavier ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [POC for v5.15 0/2] btrfs: defrag: what if v5.15 is doing proper defrag 2022-02-04 11:05 ` François-Xavier Thomas @ 2022-02-04 11:21 ` Qu Wenruo 2022-02-04 11:27 ` François-Xavier Thomas 2022-02-05 10:19 ` Qu Wenruo 1 sibling, 1 reply; 18+ messages in thread From: Qu Wenruo @ 2022-02-04 11:21 UTC (permalink / raw) To: François-Xavier Thomas, Qu Wenruo; +Cc: Filipe Manana, linux-btrfs On 2022/2/4 19:05, François-Xavier Thomas wrote: >> But one thing to mention is, the color scheme is little weird to me. > > Good point, here's a custom graph with just the write rate, that > should be easier to read than the default graph with everything: > https://i.imgur.com/vlRPOFr.png This is indeed way better. Although now the legend only shows write MB/s, but I still see some pink lines. Does the pink line still belongs to the sda write MB/s? Or it's just the read MB/s? And if possible, read MB/s would also help. > > Waiting for your next update then. In the mean time are there other > statistics I should collect that would make this easier to debug? Other debug data can be too large and slow down the system, like ftrace event for each cluster. Another thing which can greatly help is sorting the IO by its METADATA flag. (E.g. providing metadata/data read/write MB/s separately) But unfortunately I don't have any idea to monitor it without using eBPF... Thanks, Qu > > Thanks, > François-Xavier ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [POC for v5.15 0/2] btrfs: defrag: what if v5.15 is doing proper defrag 2022-02-04 11:21 ` Qu Wenruo @ 2022-02-04 11:27 ` François-Xavier Thomas 2022-02-04 11:28 ` François-Xavier Thomas 0 siblings, 1 reply; 18+ messages in thread From: François-Xavier Thomas @ 2022-02-04 11:27 UTC (permalink / raw) To: Qu Wenruo; +Cc: Qu Wenruo, Filipe Manana, linux-btrfs > Although now the legend only shows write MB/s, but I still see some pink lines. Zabbix shows slightly different colors for statistics when the timeline is zoomed out far enough: reddish pink is maximum, light green is minimum ; the legend shows the average. That was indeed a little muddy in the previous graphs! > read MB/s would also help. Here's a set of separate graphs with read/write ops/bytes (no real difference in reads): https://imgur.com/1Zp47zj François-Xavier ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [POC for v5.15 0/2] btrfs: defrag: what if v5.15 is doing proper defrag 2022-02-04 11:27 ` François-Xavier Thomas @ 2022-02-04 11:28 ` François-Xavier Thomas 0 siblings, 0 replies; 18+ messages in thread From: François-Xavier Thomas @ 2022-02-04 11:28 UTC (permalink / raw) To: Qu Wenruo; +Cc: Qu Wenruo, Filipe Manana, linux-btrfs > Here's a set of separate graphs with read/write ops/bytes (no real difference in reads): Ah, that was the wrong link, here's the correct one: https://imgur.com/a/nB4dDmF François-Xavier ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [POC for v5.15 0/2] btrfs: defrag: what if v5.15 is doing proper defrag 2022-02-04 11:05 ` François-Xavier Thomas 2022-02-04 11:21 ` Qu Wenruo @ 2022-02-05 10:19 ` Qu Wenruo 1 sibling, 0 replies; 18+ messages in thread From: Qu Wenruo @ 2022-02-05 10:19 UTC (permalink / raw) To: François-Xavier Thomas, Qu Wenruo; +Cc: Filipe Manana, linux-btrfs On 2022/2/4 19:05, François-Xavier Thomas wrote: >> But one thing to mention is, the color scheme is little weird to me. > > Good point, here's a custom graph with just the write rate, that > should be easier to read than the default graph with everything: > https://i.imgur.com/vlRPOFr.png > > Waiting for your next update then. In the mean time are there other > statistics I should collect that would make this easier to debug? Here is my branch: https://github.com/adam900710/linux/tree/autodefrag_fixes This branch contains the following fixes which are not yet in misc-next branch: ("btrfs: defrag: bring back the old file extent search behavior") This addresses several problems exposed by Filipe, all related to the btrfs_get_extent() call which get heavily used in v5.16. Unfortunately, I don't yet have good idea on how to craft a pinpoint test case for it. Thus I'm not that confident whether this is the last piece. ("btrfs: populate extent_map::generation when reading from disk") And this problem is already there at least from v5.15. Thankfully it doesn't affect the autodefrag behavior yet. ("btrfs: defrag: allow defrag_one_cluster() to skip large extent which is not a target") ("btrfs: defrag: make btrfs_defrag_file() to report accurate number of defragged sectors") ("btrfs: defrag: use btrfs_defrag_ctrl to replace btrfs_ioctl_defrag_range_args for btrfs_defrag_file()") ("btrfs: defrag: introduce btrfs_defrag_ctrl structure for later usage") ("btrfs: uapi: introduce BTRFS_DEFRAG_RANGE_MASK for later sanity check") This series has been sent the list, and is mostly a pure optimization to skip large extents faster, should not change the data write IO though. ("btrfs: defrag: remove an ambiguous condition for rejection") ("btrfs: defrag: don't defrag extents which is already at its max capacity") ("btrfs: defrag: don't try to merge regular extents with preallocated extents") This series has been sent to the list, and is to address the old bad behavior around preallocated extents. The base is misc-next branch from David, which is further based on v5.17-rc2. Thanks, Qu > > Thanks, > François-Xavier ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2022-02-05 10:19 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-01-25 6:50 [POC for v5.15 0/2] btrfs: defrag: what if v5.15 is doing proper defrag Qu Wenruo 2022-01-25 6:50 ` [POC for v5.15 1/2] btrfs: defrag: don't defrag preallocated extents Qu Wenruo 2022-01-25 6:50 ` [POC for v5.15 2/2] btrfs: defrag: limit cluster size to the first hole/prealloc range Qu Wenruo 2022-01-25 10:37 ` [POC for v5.15 0/2] btrfs: defrag: what if v5.15 is doing proper defrag Filipe Manana 2022-01-25 10:55 ` Qu Wenruo 2022-01-25 11:05 ` Qu Wenruo 2022-01-25 19:58 ` François-Xavier Thomas 2022-02-01 21:18 ` François-Xavier Thomas 2022-02-02 0:35 ` Qu Wenruo 2022-02-02 1:18 ` Qu Wenruo 2022-02-02 19:01 ` François-Xavier Thomas 2022-02-04 9:32 ` François-Xavier Thomas 2022-02-04 9:49 ` Qu Wenruo 2022-02-04 11:05 ` François-Xavier Thomas 2022-02-04 11:21 ` Qu Wenruo 2022-02-04 11:27 ` François-Xavier Thomas 2022-02-04 11:28 ` François-Xavier Thomas 2022-02-05 10:19 ` Qu Wenruo
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.