Re: fiemap is slow on btrfs on files with multiple extents

From: Filipe Manana <fdmanana@kernel.org>
To: Dominique MARTINET <dominique.martinet@atmark-techno.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>,
	Josef Bacik <josef@toxicpanda.com>, Chris Mason <clm@fb.com>,
	David Sterba <dsterba@suse.com>,
	linux-btrfs@vger.kernel.org, lkml <linux-kernel@vger.kernel.org>,
	Chen Liang-Chun <featherclc@gmail.com>,
	Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>,
	kernel@openvz.org, Yu Kuai <yukuai3@huawei.com>,
	"Theodore Ts'o" <tytso@mit.edu>
Subject: Re: fiemap is slow on btrfs on files with multiple extents
Date: Wed, 21 Sep 2022 10:00:37 +0100	[thread overview]
Message-ID: <CAL3q7H5cL+4W6SQApq=ZhkzffvZAR2cEWK0bduNun+OkFevk=g@mail.gmail.com> (raw)
In-Reply-To: <Yyq9lfH3AP8I/pwd@atmark-techno.com>

On Wed, Sep 21, 2022 at 8:30 AM Dominique MARTINET
<dominique.martinet@atmark-techno.com> wrote:
>
> Filipe Manana wrote on Thu, Sep 01, 2022 at 02:25:12PM +0100:
> > It took me a bit more than I expected, but here is the patchset to make fiemap
> > (and lseek) much more efficient on btrfs:
> >
> > https://lore.kernel.org/linux-btrfs/cover.1662022922.git.fdmanana@suse.com/
> >
> > And also available in this git branch:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/log/?h=lseek_fiemap_scalability
>
> Thanks a lot!
> Sorry for the slow reply, it took me a while to find time to get back to
> my test setup.
>
> There's still this weird behaviour that later calls to cp are slower
> than the first, but the improvement is so good that it doesn't matter
> quite as much -- I haven't been able to reproduce the rcu stalls in qemu
> so I can't say for sure but they probably won't be a problem anymore.
>
> From a quick look with perf record/report the difference still seems to
> stem from fiemap (time spent there goes from 4.13 to 45.20%), so there
> is still more processing once the file is (at least partially) in cache,
> but it has gotten much better.
>
>
> (tests run on a laptop so assume some inconsistency with thermal
> throttling etc)
>
> /mnt/t/t # compsize bigfile
> Processed 1 file, 194955 regular extents (199583 refs), 0 inline.
> Type       Perc     Disk Usage   Uncompressed Referenced
> TOTAL       15%      3.7G          23G          23G
> none       100%      477M         477M         514M
> zstd        14%      3.2G          23G          23G
> /mnt/t/t # time cp bigfile /dev/null
> real    0m 44.52s
> user    0m 0.49s
> sys     0m 32.91s
> /mnt/t/t # time cp bigfile /dev/null
> real    0m 46.81s
> user    0m 0.55s
> sys     0m 35.63s
> /mnt/t/t # time cp bigfile /dev/null
> real    1m 13.63s
> user    0m 0.55s
> sys     1m 1.89s
> /mnt/t/t # time cp bigfile /dev/null
> real    1m 13.44s
> user    0m 0.53s
> sys     1m 2.08s
>
>
> For comparison here's how it was on 6.0-rc2 your branch is based on:
> /mnt/t/t # time cp atde-test /dev/null
> real    0m 46.17s
> user    0m 0.60s
> sys     0m 33.21s
> /mnt/t/t # time cp atde-test /dev/null
> real    5m 35.92s
> user    0m 0.57s
> sys     5m 24.20s
>
>
>
> If you're curious the report blames set_extent_bit and
> clear_state_bit as follow; get_extent_skip_holes is completely gone; but
> I wouldn't necessarily say this needs much more time spent on it.

get_extent_skip_holes() no longer exists, so 0% of time spent there :)

Yes, I know. The reason you see so much time spent on
lock_extent_bits() is basically
because cp does too many fiemap calls with a very small extent buffer size.
I pointed that out here:

https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/

Making it use a larger buffer (say 500 or 1000 extents), would make it
a lot better.
But as I pointed out there, last year cp was changed to not use fiemap
to detect holes anymore,
now it uses lseek with SEEK_HOLE mode. So with time, everyone will get
a cp version that does
not use fiemap anymore.

Also, for the cp case, since it does many read and fiemap calls to the
source file, the following
patch probably helps too:

https://lore.kernel.org/linux-btrfs/20220819024408.9714-1-ethanlien@synology.com/

Because it will make the io tree smaller. That should land on 6.1 too.

Thanks for testing and the report.

>
> 45.20%--extent_fiemap
> |
> |--31.02%--lock_extent_bits
> |          |
> |           --30.78%--set_extent_bit
> |                     |
> |                     |--6.93%--insert_state
> |                     |          |
> |                     |           --0.70%--set_state_bits
> |                     |
> |                     |--4.25%--alloc_extent_state
> |                     |          |
> |                     |           --3.86%--kmem_cache_alloc
> |                     |
> |                     |--2.77%--_raw_spin_lock
> |                     |          |
> |                     |           --1.23%--preempt_count_add
> |                     |
> |                     |--2.48%--rb_next
> |                     |
> |                     |--1.13%--_raw_spin_unlock
> |                     |          |
> |                     |           --0.55%--preempt_count_sub
> |                     |
> |                      --0.92%--set_state_bits
> |
>  --13.80%--__clear_extent_bit
>            |
>             --13.30%--clear_state_bit
>                       |
>                       |           --3.48%--_raw_spin_unlock_irqrestore
>                       |
>                       |--2.45%--merge_state.part.0
>                       |          |
>                       |           --1.57%--rb_next
>                       |
>                       |--2.14%--__slab_free
>                       |          |
>                       |           --1.26%--cmpxchg_double_slab.constprop.0.isra.0
>                       |
>                       |--0.74%--free_extent_state
>                       |
>                       |--0.70%--kmem_cache_free
>                       |
>                       |--0.69%--btrfs_clear_delalloc_extent
>                       |
>                        --0.52%--rb_next
>
>
>
> Thanks!
> --
> Dominique