regressions.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Sergei Trofimovich <slyich@gmail.com>
To: "Linux regression tracking (Thorsten Leemhuis)"
	<regressions@leemhuis.info>
Cc: Linux regressions mailing list <regressions@lists.linux.dev>,
	Anand Jain <anand.jain@oracle.com>,
	linux-btrfs@vger.kernel.org, David Sterba <dsterba@suse.com>,
	Boris Burkov <boris@bur.io>, Chris Mason <clm@fb.com>,
	Josef Bacik <josef@toxicpanda.com>
Subject: Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
Date: Wed, 15 Mar 2023 16:34:57 +0000	[thread overview]
Message-ID: <20230315163457.35bb3b75@nz> (raw)
In-Reply-To: <5f0b44bb-e06e-bc47-b688-d9cfb5b490d3@leemhuis.info>

[-- Attachment #1: Type: text/plain, Size: 8348 bytes --]

On Wed, 15 Mar 2023 12:44:34 +0100
"Linux regression tracking (Thorsten Leemhuis)" <regressions@leemhuis.info> wrote:

> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
> for once, to make this easily accessible to everyone.
> 
> I added this to the tracking, but it seems nothing happened for nearly
> two weeks.
> 
> Sergei, did you find a workaround that works for you? Or is this
> something that other people might run into as well and thus better
> should be fixed?

I used the workaround of cranking up IOPS of discard from 10 to 1000:

    # echo 1000 > /sys//fs/btrfs/<UUID>/discard/iops_limit

But I am not sure if it's a safe or a reasonable fix. I would prefer someone from
btrfs to comment if it's an expected kernel's behaviour.

> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
> 
> #regzbot poke
> 
> On 02.03.23 11:54, Sergei Trofimovich wrote:
> > On Thu, 2 Mar 2023 17:12:27 +0800
> > Anand Jain <anand.jain@oracle.com> wrote:
> >   
> >> On 3/2/23 03:30, Sergei Trofimovich wrote:  
> >>> Hi btrfs maintainers!
> >>>
> >>> Tl;DR:
> >>>
> >>>    After 63a7cb13071842 "btrfs: auto enable discard=async when possible" I
> >>>    see constant DISCARD storm towards my NVME device be it idle or not.
> >>>
> >>>    No storm: v6.1 and older
> >>>    Has storm: v6.2 and newer
> >>>
> >>> More words:
> >>>
> >>> After upgrade from 6.1 to 6.2 I noticed that Disk led on my desktop
> >>> started flashing incessantly regardless of present or absent workload.
> >>>
> >>> I think I confirmed the storm with `perf`: led flashes align with output
> >>> of:
> >>>
> >>>      # perf ftrace -a -T 'nvme_setup*' | cat
> >>>
> >>>      kworker/6:1H-298     [006]   2569.645201: nvme_setup_cmd <-nvme_queue_rq
> >>>      kworker/6:1H-298     [006]   2569.645205: nvme_setup_discard <-nvme_setup_cmd
> >>>      kworker/6:1H-298     [006]   2569.749198: nvme_setup_cmd <-nvme_queue_rq
> >>>      kworker/6:1H-298     [006]   2569.749202: nvme_setup_discard <-nvme_setup_cmd
> >>>      kworker/6:1H-298     [006]   2569.853204: nvme_setup_cmd <-nvme_queue_rq
> >>>      kworker/6:1H-298     [006]   2569.853209: nvme_setup_discard <-nvme_setup_cmd
> >>>      kworker/6:1H-298     [006]   2569.958198: nvme_setup_cmd <-nvme_queue_rq
> >>>      kworker/6:1H-298     [006]   2569.958202: nvme_setup_discard <-nvme_setup_cmd
> >>>
> >>> `iotop` shows no read/write IO at all (expected).
> >>>
> >>> I was able to bisect it down to this commit:
> >>>
> >>>    $ git bisect good
> >>>    63a7cb13071842966c1ce931edacbc23573aada5 is the first bad commit
> >>>    commit 63a7cb13071842966c1ce931edacbc23573aada5
> >>>    Author: David Sterba <dsterba@suse.com>
> >>>    Date:   Tue Jul 26 20:54:10 2022 +0200
> >>>
> >>>      btrfs: auto enable discard=async when possible
> >>>
> >>>      There's a request to automatically enable async discard for capable
> >>>      devices. We can do that, the async mode is designed to wait for larger
> >>>      freed extents and is not intrusive, with limits to iops, kbps or latency.
> >>>
> >>>      The status and tunables will be exported in /sys/fs/btrfs/FSID/discard .
> >>>
> >>>      The automatic selection is done if there's at least one discard capable
> >>>      device in the filesystem (not capable devices are skipped). Mounting
> >>>      with any other discard option will honor that option, notably mounting
> >>>      with nodiscard will keep it disabled.
> >>>
> >>>      Link: https://lore.kernel.org/linux-btrfs/CAEg-Je_b1YtdsCR0zS5XZ_SbvJgN70ezwvRwLiCZgDGLbeMB=w@mail.gmail.com/
> >>>      Reviewed-by: Boris Burkov <boris@bur.io>
> >>>      Signed-off-by: David Sterba <dsterba@suse.com>
> >>>
> >>>     fs/btrfs/ctree.h   |  1 +
> >>>     fs/btrfs/disk-io.c | 14 ++++++++++++++
> >>>     fs/btrfs/super.c   |  2 ++
> >>>     fs/btrfs/volumes.c |  3 +++
> >>>     fs/btrfs/volumes.h |  2 ++
> >>>     5 files changed, 22 insertions(+)
> >>>
> >>> Is this storm a known issue? I did not dig too much into the patch. But
> >>> glancing at it this bit looks slightly off:
> >>>
> >>>      +       if (bdev_max_discard_sectors(bdev))
> >>>      +               fs_devices->discardable = true;
> >>>
> >>> Is it expected that there is no `= false` assignment?
> >>>
> >>> This is the list of `btrfs` filesystems I have:
> >>>
> >>>    $ cat /proc/mounts | fgrep btrfs
> >>>    /dev/nvme0n1p3 / btrfs rw,noatime,compress=zstd:3,ssd,space_cache,subvolid=848,subvol=/nixos 0 0
> >>>    /dev/sda3 /mnt/archive btrfs rw,noatime,compress=zstd:3,space_cache,subvolid=5,subvol=/ 0 0
> >>>    # skipped bind mounts
> >>>     
> >>
> >>
> >>  
> >>> The device is:
> >>>
> >>>    $ lspci | fgrep -i Solid
> >>>    01:00.0 Non-Volatile memory controller: ADATA Technology Co., Ltd. XPG SX8200 Pro PCIe Gen3x4 M.2 2280 Solid State Drive (rev 03)    
> >>
> >>
> >>   It is a SSD device with NVME interface, that needs regular discard.
> >>   Why not try tune io intensity using
> >>
> >>   /sys/fs/btrfs/<uuid>/discard
> >>
> >>   options?
> >>
> >>   Maybe not all discardable sectors are not issued at once. It is a good
> >>   idea to try with a fresh mkfs (which runs discard at mkfs) to see if
> >>   discard is being issued even if there are no fs activities.  
> > 
> > Ah, thank you Anand! I poked a bit more in `perf ftrace` and I think I
> > see a "slow" pass through the discard backlog:
> > 
> >     /sys/fs/btrfs/<UUID>/discard$  cat iops_limit
> >     10
> > 
> > Twice a minute I get a short burst of file creates/deletes that produces
> > a bit of free space in many block groups. That enqueues hundreds of work
> > items.
> > 
> >     $ sudo perf ftrace -a -T 'btrfs_discard_workfn' -T 'btrfs_issue_discard' -T 'btrfs_discard_queue_work'
> >      btrfs-transacti-407     [011]  42800.424027: btrfs_discard_queue_work <-__btrfs_add_free_space
> >      btrfs-transacti-407     [011]  42800.424070: btrfs_discard_queue_work <-__btrfs_add_free_space
> >      ...
> >      btrfs-transacti-407     [011]  42800.425053: btrfs_discard_queue_work <-__btrfs_add_free_space
> >      btrfs-transacti-407     [011]  42800.425055: btrfs_discard_queue_work <-__btrfs_add_free_space
> > 
> > 193 entries of btrfs_discard_queue_work.
> > It took 1ms to enqueue all of the work into the workqueue.
> >     
> >      kworker/u64:1-2379115 [000]  42800.487010: btrfs_discard_workfn <-process_one_work
> >      kworker/u64:1-2379115 [000]  42800.487028: btrfs_issue_discard <-btrfs_discard_extent
> >      kworker/u64:1-2379115 [005]  42800.594010: btrfs_discard_workfn <-process_one_work
> >      kworker/u64:1-2379115 [005]  42800.594031: btrfs_issue_discard <-btrfs_discard_extent
> >      ...
> >      kworker/u64:15-2396822 [007]  42830.441487: btrfs_discard_workfn <-process_one_work
> >      kworker/u64:15-2396822 [007]  42830.441502: btrfs_issue_discard <-btrfs_discard_extent
> >      kworker/u64:15-2396822 [000]  42830.546497: btrfs_discard_workfn <-process_one_work
> >      kworker/u64:15-2396822 [000]  42830.546524: btrfs_issue_discard <-btrfs_discard_extent
> > 
> > 286 pairs of btrfs_discard_workfn / btrfs_issue_discard.
> > Each pair takes 10ms to process, which seems to match iops_limit=10.
> > That means I can get about 300 discards per second max.
> > 
> >      btrfs-transacti-407     [002]  42830.634216: btrfs_discard_queue_work <-__btrfs_add_free_space
> >      btrfs-transacti-407     [002]  42830.634228: btrfs_discard_queue_work <-__btrfs_add_free_space
> >      ...
> > 
> > Next transaction started 30 seconds later, which is a default commit
> > interval.
> > 
> > My file system is of 512GB size. My guess I get about one discard entry
> > per block group on each 
> > 
> > Does my system keeps up with scheduled discard backlog? Can I peek at
> > workqueue size?
> > 
> > Is iops_limit=10 a reasonable default for discard=async? It feels like
> > for larger file systems it will not be enough even for this idle state.
> >   


-- 

  Sergei

[-- Attachment #2: Цифровая подпись OpenPGP --]
[-- Type: application/pgp-signature, Size: 981 bytes --]

  reply	other threads:[~2023-03-15 16:35 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <Y/+n1wS/4XAH7X1p@nz>
2023-03-02  8:04 ` [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async Linux regression tracking #adding (Thorsten Leemhuis)
2023-04-04 10:52   ` Linux regression tracking #update (Thorsten Leemhuis)
2023-04-21 13:56   ` Linux regression tracking #update (Thorsten Leemhuis)
     [not found] ` <94cf49d0-fa2d-cc2c-240e-222706d69eb3@oracle.com>
     [not found]   ` <20230302105406.2cd367f7@nz>
2023-03-15 11:44     ` Linux regression tracking (Thorsten Leemhuis)
2023-03-15 16:34       ` Sergei Trofimovich [this message]
2023-03-20 22:40 Christopher Price
2023-03-21 21:26 ` Josef Bacik
2023-03-22  8:38   ` Christoph Hellwig
2023-03-23 22:26     ` Sergei Trofimovich
2023-04-04 10:49       ` Linux regression tracking (Thorsten Leemhuis)
2023-04-04 16:04         ` Christoph Hellwig
2023-04-04 16:20           ` Roman Mamedov
2023-04-04 16:27             ` Christoph Hellwig
2023-04-04 23:37               ` Damien Le Moal
2023-04-04 18:15           ` Chris Mason
2023-04-04 18:51             ` Boris Burkov
2023-04-04 19:22               ` David Sterba
2023-04-04 19:39                 ` Boris Burkov
2023-04-05  8:17                   ` Linux regression tracking (Thorsten Leemhuis)
2023-04-10  2:03               ` Michael Bromilow
2023-04-11 17:52                 ` David Sterba
2023-04-11 18:15                   ` Linux regression tracking (Thorsten Leemhuis)
2023-04-04 19:08             ` Sergei Trofimovich
2023-04-05  6:18             ` Christoph Hellwig
2023-04-05 12:01               ` Chris Mason
2023-04-04 18:23         ` Boris Burkov
2023-04-04 19:12           ` Sergei Trofimovich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230315163457.35bb3b75@nz \
    --to=slyich@gmail.com \
    --cc=anand.jain@oracle.com \
    --cc=boris@bur.io \
    --cc=clm@fb.com \
    --cc=dsterba@suse.com \
    --cc=josef@toxicpanda.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=regressions@leemhuis.info \
    --cc=regressions@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).