From: Sergei Trofimovich <slyich@gmail.com>
To: Anand Jain <anand.jain@oracle.com>
Cc: linux-btrfs@vger.kernel.org, David Sterba <dsterba@suse.com>,
Boris Burkov <boris@bur.io>, Chris Mason <clm@fb.com>,
Josef Bacik <josef@toxicpanda.com>
Subject: Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
Date: Thu, 2 Mar 2023 10:54:06 +0000 [thread overview]
Message-ID: <20230302105406.2cd367f7@nz> (raw)
In-Reply-To: <94cf49d0-fa2d-cc2c-240e-222706d69eb3@oracle.com>
On Thu, 2 Mar 2023 17:12:27 +0800
Anand Jain <anand.jain@oracle.com> wrote:
> On 3/2/23 03:30, Sergei Trofimovich wrote:
> > Hi btrfs maintainers!
> >
> > Tl;DR:
> >
> > After 63a7cb13071842 "btrfs: auto enable discard=async when possible" I
> > see constant DISCARD storm towards my NVME device be it idle or not.
> >
> > No storm: v6.1 and older
> > Has storm: v6.2 and newer
> >
> > More words:
> >
> > After upgrade from 6.1 to 6.2 I noticed that Disk led on my desktop
> > started flashing incessantly regardless of present or absent workload.
> >
> > I think I confirmed the storm with `perf`: led flashes align with output
> > of:
> >
> > # perf ftrace -a -T 'nvme_setup*' | cat
> >
> > kworker/6:1H-298 [006] 2569.645201: nvme_setup_cmd <-nvme_queue_rq
> > kworker/6:1H-298 [006] 2569.645205: nvme_setup_discard <-nvme_setup_cmd
> > kworker/6:1H-298 [006] 2569.749198: nvme_setup_cmd <-nvme_queue_rq
> > kworker/6:1H-298 [006] 2569.749202: nvme_setup_discard <-nvme_setup_cmd
> > kworker/6:1H-298 [006] 2569.853204: nvme_setup_cmd <-nvme_queue_rq
> > kworker/6:1H-298 [006] 2569.853209: nvme_setup_discard <-nvme_setup_cmd
> > kworker/6:1H-298 [006] 2569.958198: nvme_setup_cmd <-nvme_queue_rq
> > kworker/6:1H-298 [006] 2569.958202: nvme_setup_discard <-nvme_setup_cmd
> >
> > `iotop` shows no read/write IO at all (expected).
> >
> > I was able to bisect it down to this commit:
> >
> > $ git bisect good
> > 63a7cb13071842966c1ce931edacbc23573aada5 is the first bad commit
> > commit 63a7cb13071842966c1ce931edacbc23573aada5
> > Author: David Sterba <dsterba@suse.com>
> > Date: Tue Jul 26 20:54:10 2022 +0200
> >
> > btrfs: auto enable discard=async when possible
> >
> > There's a request to automatically enable async discard for capable
> > devices. We can do that, the async mode is designed to wait for larger
> > freed extents and is not intrusive, with limits to iops, kbps or latency.
> >
> > The status and tunables will be exported in /sys/fs/btrfs/FSID/discard .
> >
> > The automatic selection is done if there's at least one discard capable
> > device in the filesystem (not capable devices are skipped). Mounting
> > with any other discard option will honor that option, notably mounting
> > with nodiscard will keep it disabled.
> >
> > Link: https://lore.kernel.org/linux-btrfs/CAEg-Je_b1YtdsCR0zS5XZ_SbvJgN70ezwvRwLiCZgDGLbeMB=w@mail.gmail.com/
> > Reviewed-by: Boris Burkov <boris@bur.io>
> > Signed-off-by: David Sterba <dsterba@suse.com>
> >
> > fs/btrfs/ctree.h | 1 +
> > fs/btrfs/disk-io.c | 14 ++++++++++++++
> > fs/btrfs/super.c | 2 ++
> > fs/btrfs/volumes.c | 3 +++
> > fs/btrfs/volumes.h | 2 ++
> > 5 files changed, 22 insertions(+)
> >
> > Is this storm a known issue? I did not dig too much into the patch. But
> > glancing at it this bit looks slightly off:
> >
> > + if (bdev_max_discard_sectors(bdev))
> > + fs_devices->discardable = true;
> >
> > Is it expected that there is no `= false` assignment?
> >
> > This is the list of `btrfs` filesystems I have:
> >
> > $ cat /proc/mounts | fgrep btrfs
> > /dev/nvme0n1p3 / btrfs rw,noatime,compress=zstd:3,ssd,space_cache,subvolid=848,subvol=/nixos 0 0
> > /dev/sda3 /mnt/archive btrfs rw,noatime,compress=zstd:3,space_cache,subvolid=5,subvol=/ 0 0
> > # skipped bind mounts
> >
>
>
>
> > The device is:
> >
> > $ lspci | fgrep -i Solid
> > 01:00.0 Non-Volatile memory controller: ADATA Technology Co., Ltd. XPG SX8200 Pro PCIe Gen3x4 M.2 2280 Solid State Drive (rev 03)
>
>
> It is a SSD device with NVME interface, that needs regular discard.
> Why not try tune io intensity using
>
> /sys/fs/btrfs/<uuid>/discard
>
> options?
>
> Maybe not all discardable sectors are not issued at once. It is a good
> idea to try with a fresh mkfs (which runs discard at mkfs) to see if
> discard is being issued even if there are no fs activities.
Ah, thank you Anand! I poked a bit more in `perf ftrace` and I think I
see a "slow" pass through the discard backlog:
/sys/fs/btrfs/<UUID>/discard$ cat iops_limit
10
Twice a minute I get a short burst of file creates/deletes that produces
a bit of free space in many block groups. That enqueues hundreds of work
items.
$ sudo perf ftrace -a -T 'btrfs_discard_workfn' -T 'btrfs_issue_discard' -T 'btrfs_discard_queue_work'
btrfs-transacti-407 [011] 42800.424027: btrfs_discard_queue_work <-__btrfs_add_free_space
btrfs-transacti-407 [011] 42800.424070: btrfs_discard_queue_work <-__btrfs_add_free_space
...
btrfs-transacti-407 [011] 42800.425053: btrfs_discard_queue_work <-__btrfs_add_free_space
btrfs-transacti-407 [011] 42800.425055: btrfs_discard_queue_work <-__btrfs_add_free_space
193 entries of btrfs_discard_queue_work.
It took 1ms to enqueue all of the work into the workqueue.
kworker/u64:1-2379115 [000] 42800.487010: btrfs_discard_workfn <-process_one_work
kworker/u64:1-2379115 [000] 42800.487028: btrfs_issue_discard <-btrfs_discard_extent
kworker/u64:1-2379115 [005] 42800.594010: btrfs_discard_workfn <-process_one_work
kworker/u64:1-2379115 [005] 42800.594031: btrfs_issue_discard <-btrfs_discard_extent
...
kworker/u64:15-2396822 [007] 42830.441487: btrfs_discard_workfn <-process_one_work
kworker/u64:15-2396822 [007] 42830.441502: btrfs_issue_discard <-btrfs_discard_extent
kworker/u64:15-2396822 [000] 42830.546497: btrfs_discard_workfn <-process_one_work
kworker/u64:15-2396822 [000] 42830.546524: btrfs_issue_discard <-btrfs_discard_extent
286 pairs of btrfs_discard_workfn / btrfs_issue_discard.
Each pair takes 10ms to process, which seems to match iops_limit=10.
That means I can get about 300 discards per second max.
btrfs-transacti-407 [002] 42830.634216: btrfs_discard_queue_work <-__btrfs_add_free_space
btrfs-transacti-407 [002] 42830.634228: btrfs_discard_queue_work <-__btrfs_add_free_space
...
Next transaction started 30 seconds later, which is a default commit
interval.
My file system is of 512GB size. My guess I get about one discard entry
per block group on each
Does my system keeps up with scheduled discard backlog? Can I peek at
workqueue size?
Is iops_limit=10 a reasonable default for discard=async? It feels like
for larger file systems it will not be enough even for this idle state.
--
Sergei
next prev parent reply other threads:[~2023-03-02 10:54 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-01 19:30 [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async Sergei Trofimovich
2023-03-02 8:04 ` Linux regression tracking #adding (Thorsten Leemhuis)
2023-04-04 10:52 ` Linux regression tracking #update (Thorsten Leemhuis)
2023-04-21 13:56 ` Linux regression tracking #update (Thorsten Leemhuis)
2023-03-02 9:12 ` Anand Jain
2023-03-02 10:54 ` Sergei Trofimovich [this message]
2023-03-15 11:44 ` Linux regression tracking (Thorsten Leemhuis)
2023-03-15 16:34 ` Sergei Trofimovich
2023-03-20 22:40 Christopher Price
2023-03-21 21:26 ` Josef Bacik
2023-03-22 8:38 ` Christoph Hellwig
2023-03-23 22:26 ` Sergei Trofimovich
2023-04-04 10:49 ` Linux regression tracking (Thorsten Leemhuis)
2023-04-04 16:04 ` Christoph Hellwig
2023-04-04 16:20 ` Roman Mamedov
2023-04-04 16:27 ` Christoph Hellwig
2023-04-04 23:37 ` Damien Le Moal
2023-04-04 18:15 ` Chris Mason
2023-04-04 18:51 ` Boris Burkov
2023-04-04 19:22 ` David Sterba
2023-04-04 19:39 ` Boris Burkov
2023-04-05 8:17 ` Linux regression tracking (Thorsten Leemhuis)
2023-04-10 2:03 ` Michael Bromilow
2023-04-11 17:52 ` David Sterba
2023-04-11 18:15 ` Linux regression tracking (Thorsten Leemhuis)
2023-04-04 19:08 ` Sergei Trofimovich
2023-04-05 6:18 ` Christoph Hellwig
2023-04-05 12:01 ` Chris Mason
2023-04-04 18:23 ` Boris Burkov
2023-04-04 19:12 ` Sergei Trofimovich
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230302105406.2cd367f7@nz \
--to=slyich@gmail.com \
--cc=anand.jain@oracle.com \
--cc=boris@bur.io \
--cc=clm@fb.com \
--cc=dsterba@suse.com \
--cc=josef@toxicpanda.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).