From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out4-smtp.messagingengine.com (out4-smtp.messagingengine.com [66.111.4.28]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C8349C951 for ; Tue, 4 Apr 2023 18:23:20 +0000 (UTC) Received: from compute2.internal (compute2.nyi.internal [10.202.2.46]) by mailout.nyi.internal (Postfix) with ESMTP id AE9C05C004E; Tue, 4 Apr 2023 14:23:19 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute2.internal (MEProxy); Tue, 04 Apr 2023 14:23:19 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bur.io; h=cc:cc :content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:sender :subject:subject:to:to; s=fm2; t=1680632599; x=1680718999; bh=Ai Qi5Cw1/rZnptSfsuvBndr28ypkFZK63VDk4qsSMAo=; b=Qdeb5gGLHimARKOVqa gyeSfHn2pnCOfZCYJbYUiMb+9pzIQ1Mh6ZOFzvTTpQAfrmFSfCkZTfeW4rIEUxat gvzS0P2mqRnzQhf6a3hZK+iu8mOL8RiH44QgsDsygeuw7C0zdqY4XOrNFFXT0gnp gn8/IB/IrBomE5ZiK/XmvRsDKy+1/9gA6Uix3+ZzowYOWNcxVEXKFxY/BVHFaF5R zBdqUzas6K2L+kugx5tJ8zGpaa6Zmg4hwEE1IxJLjtLZr4smP2SdgT9I6yolWEnz ZW84xBxrGMd+gZBMQesO53i0HiOghF48uXfltCpPh/eC63V8nurGrYytyRJ4EEHX VkMA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:sender:subject :subject:to:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm2; t=1680632599; x=1680718999; bh=AiQi5Cw1/rZnp tSfsuvBndr28ypkFZK63VDk4qsSMAo=; b=A3ltF/Na3wHuLClSUEWYxvhSTeknD 1P+QwwCn2r4CBAK4TBGEKnynNfLpdFQ9jCCnQxo2NmFw3ZMWzVhowof/IX8Rw6NB cBNxJmW20tfq1DdWkSxy2XYysO7qFCxEAZ0Wx4BTQNZn5+CRHUXFcbf1mjuXWmUT 5yVW2HPvGA82pBRaQdfrea/U1pK7ZZ+JbrX10Glcji9cbI3M3PO/SGnZxXw6ovua z2vTliyiQCsNaNeokDu+vggkuNwaCtkBFnAZzKFnEBkrqVo+H6VpFzf9ikdNGvuy F6TnHPIKHI6mXgZj56p4pKSBnCVKUU0WNCAzK5UkVbYEJJO4qfBY07QZg== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvhedrvdeiledguddvgecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh necuuegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmd enucfjughrpeffhffvvefukfhfgggtuggjsehttdertddttddvnecuhfhrohhmpeeuohhr ihhsuceuuhhrkhhovhcuoegsohhrihhssegsuhhrrdhioheqnecuggftrfgrthhtvghrnh epudduvdeugfeftedvleekheffhfeutddvjeehheeitdduffetudefheetfedugeehnecu ffhomhgrihhnpehrvgguhhgrthdrtghomhdprhgvugguihhtrdgtohhmpdhlvggvmhhhuh hishdrihhnfhhonecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhf rhhomhepsghorhhishessghurhdrihho X-ME-Proxy: Feedback-ID: i083147f8:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 4 Apr 2023 14:23:18 -0400 (EDT) Date: Tue, 4 Apr 2023 11:23:12 -0700 From: Boris Burkov To: "Linux regression tracking (Thorsten Leemhuis)" Cc: Sergei Trofimovich , Christoph Hellwig , Josef Bacik , Christopher Price , anand.jain@oracle.com, clm@fb.com, dsterba@suse.com, linux-btrfs@vger.kernel.org, regressions@lists.linux.dev Subject: Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async Message-ID: <20230404182256.GA344341@zen> References: <20230323222606.20d10de2@nz> <20d85dc4-b6c2-dac1-fdc6-94e44b43692a@leemhuis.info> Precedence: bulk X-Mailing-List: regressions@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20d85dc4-b6c2-dac1-fdc6-94e44b43692a@leemhuis.info> On Tue, Apr 04, 2023 at 12:49:40PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote: > On 23.03.23 23:26, Sergei Trofimovich wrote: > > On Wed, 22 Mar 2023 01:38:42 -0700 > > Christoph Hellwig wrote: > > > >> On Tue, Mar 21, 2023 at 05:26:49PM -0400, Josef Bacik wrote: > >>> We got the defaults based on our testing with our workloads inside of > >>> FB. Clearly this isn't representative of a normal desktop usage, but > >>> we've also got a lot of workloads so figured if it made the whole > >>> fleet happy it would probably be fine everywhere. > >>> > >>> That being said this is tunable for a reason, your workload seems to > >>> generate a lot of free'd extents and discards. We can probably mess > >>> with the async stuff to maybe pause discarding if there's no other > >>> activity happening on the device at the moment, but tuning it to let > >>> more discards through at a time is also completely valid. Thanks, > > BTW, there is another report about this issue here: > https://bugzilla.redhat.com/show_bug.cgi?id=2182228 > > /me wonders if there is a opensuse report as well, but a quick search > didn't find one > > And as fun fact or for completeness, the issue even made it to reddit, too: > https://www.reddit.com/r/archlinux/comments/121htxn/btrfs_discard_storm_on_62x_kernel/ Good find, but also: https://www.reddit.com/r/Fedora/comments/vjfpkv/periodic_trim_freezes_ssd/ So without harder data, there is a bit of bias inherent in cherrypicking negative impressions from the internet. > > >> FYI, discard performance differs a lot between different SSDs. > >> It used to be pretty horrible for most devices early on, and then a > >> certain hyperscaler started requiring decent performance for enterprise > >> drives, so many of them are good now. A lot less so for the typical > >> consumer drive, especially at the lower end of the spectrum. > >> > >> And that jut NVMe, the still shipping SATA SSDs are another different > >> story. Not helped by the fact that we don't even support ranged > >> discards for them in Linux. > > Thx for your comments Christoph. Quick question, just to be sure I > understand things properly: > > I assume on some of those problematic devices these discard storms will > lead to a performance regression? > > I also heard people saying these discard storms might reduce the life > time of some devices - is that true? > > If the answer to at least one of these is "yes" I'd say we it might be > best to revert 63a7cb130718 for now. > > > Josef, what did you use as a signal to detect what value was good > > enough? Did you crank up the number until discard backlog clears up in a > > reasonable time? Josef is OOO, so I'll try to clarify some things around async discard, hopefully it's helpful to anyone wondering how to tune it. Like you guessed, our tuning basically consists of looking at the discardable_extents/discardable_bytes metric in the fleet and ensuring it looks sane, and that we see an improvement in I/O tail latencies or fix some concrete instances of bad tail latencies. e.g. with iops_limit=10, we see concrete cases of bad latency go away and don't see a steady buildup of discardable_extents. > > > > I still don't understand what I should take into account to change the > > default and whether I should change it at all. Is it fine if the discard > > backlog takes a week to get through it? (Or a day? An hour? A minute?) I believe the relevant metrics are: - number of trims issued/bytes trimmed (you would measure this by tracing and by looking at discard_extent_bytes and discard_bitmap_bytes) - bytes "wrongly" trimmed. (extents that were reallocated without getting trimmed are exposed in discard_bytes_saved, so if that drops, you are maybe trimming things that you may have not needed to) - discardable_extents/discardable_bytes (in sysfs; the outstanding work) - tail latency of file system operations - disk idle time By doing periodic trim you tradeoff better bytes_saved and better disk idle time (big trim once a week, vs. "trim all the time" against worse tail latency during the trim itself, and risking trimming too infrequently, leading to worse latency on a drive that needs a trim. > > > > Is it fine to send discards as fast as device allows instead of doing 10 > > IOPS? Does IOPS limit consider a device wearing tradeoff? Then low IOPS > > makes sense. Or IOPS limit is just a way to reserve most bandwidth to > > non-discard workloads? Then I would say unlimited IOPS as a default > > would make more sense for btrfs. Unfortunately, btrfs currently doesn't have a "fully unlimited" async discard no matter how you tune it. Ignoring kbps_limit, which only serves to increase the delay, iops_limit has an effective range between 1 and 1000. The basic premise of btrfs async discard is to meter out the discards at a steady rate, asynchronously from file system operations, so the effect of the tunables is to set that delay between discard operations. The delay is clamped between 1ms and 1000ms, so iops_limit > 1000 is the same as iops_limit = 1000. iops_limit=0 does not lead to unmetered discards, but rather hits a hardcoded case of metering them out over 6 hours. (no clue why, I don't personally love that...) Hope that's somewhat helpful, Boris > > /me would be interested in answers to these questions as well > > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) > -- > Everything you wanna know about Linux kernel regression tracking: > https://linux-regtracking.leemhuis.info/about/#tldr > If I did something stupid, please tell me, as explained on that page.