* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async @ 2023-03-20 22:40 Christopher Price 2023-03-21 21:26 ` Josef Bacik 0 siblings, 1 reply; 27+ messages in thread From: Christopher Price @ 2023-03-20 22:40 UTC (permalink / raw) To: slyich Cc: anand.jain, boris, clm, dsterba, josef, linux-btrfs, regressions, regressions I can also confirm the issue occurred on and after kernel 6.2, and confirm the workaround setting btrfs discard iops_limit works for my device: 08:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/9 80PRO I had to use ~6000 for my device, instead of 1000. I'm currently on kernel 6.2.1. I am also interested to know what if this is expected. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async 2023-03-20 22:40 [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async Christopher Price @ 2023-03-21 21:26 ` Josef Bacik 2023-03-22 8:38 ` Christoph Hellwig 0 siblings, 1 reply; 27+ messages in thread From: Josef Bacik @ 2023-03-21 21:26 UTC (permalink / raw) To: Christopher Price Cc: slyich, anand.jain, boris, clm, dsterba, linux-btrfs, regressions, regressions On Mon, Mar 20, 2023 at 6:41 PM Christopher Price <pricechrispy@gmail.com> wrote: > > I can also confirm the issue occurred on and after kernel 6.2, and > confirm the workaround setting btrfs discard iops_limit works for my > device: > > 08:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd > NVMe SSD Controller PM9A1/PM9A3/9 > 80PRO > > I had to use ~6000 for my device, instead of 1000. > I'm currently on kernel 6.2.1. > > I am also interested to know what if this is expected. We got the defaults based on our testing with our workloads inside of FB. Clearly this isn't representative of a normal desktop usage, but we've also got a lot of workloads so figured if it made the whole fleet happy it would probably be fine everywhere. That being said this is tunable for a reason, your workload seems to generate a lot of free'd extents and discards. We can probably mess with the async stuff to maybe pause discarding if there's no other activity happening on the device at the moment, but tuning it to let more discards through at a time is also completely valid. Thanks, Josef ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async 2023-03-21 21:26 ` Josef Bacik @ 2023-03-22 8:38 ` Christoph Hellwig 2023-03-23 22:26 ` Sergei Trofimovich 0 siblings, 1 reply; 27+ messages in thread From: Christoph Hellwig @ 2023-03-22 8:38 UTC (permalink / raw) To: Josef Bacik Cc: Christopher Price, slyich, anand.jain, boris, clm, dsterba, linux-btrfs, regressions, regressions On Tue, Mar 21, 2023 at 05:26:49PM -0400, Josef Bacik wrote: > We got the defaults based on our testing with our workloads inside of > FB. Clearly this isn't representative of a normal desktop usage, but > we've also got a lot of workloads so figured if it made the whole > fleet happy it would probably be fine everywhere. > > That being said this is tunable for a reason, your workload seems to > generate a lot of free'd extents and discards. We can probably mess > with the async stuff to maybe pause discarding if there's no other > activity happening on the device at the moment, but tuning it to let > more discards through at a time is also completely valid. Thanks, FYI, discard performance differs a lot between different SSDs. It used to be pretty horrible for most devices early on, and then a certain hyperscaler started requiring decent performance for enterprise drives, so many of them are good now. A lot less so for the typical consumer drive, especially at the lower end of the spectrum. And that jut NVMe, the still shipping SATA SSDs are another different story. Not helped by the fact that we don't even support ranged discards for them in Linux. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async 2023-03-22 8:38 ` Christoph Hellwig @ 2023-03-23 22:26 ` Sergei Trofimovich 2023-04-04 10:49 ` Linux regression tracking (Thorsten Leemhuis) 0 siblings, 1 reply; 27+ messages in thread From: Sergei Trofimovich @ 2023-03-23 22:26 UTC (permalink / raw) To: Christoph Hellwig Cc: Josef Bacik, Christopher Price, anand.jain, boris, clm, dsterba, linux-btrfs, regressions, regressions [-- Attachment #1: Type: text/plain, Size: 2049 bytes --] On Wed, 22 Mar 2023 01:38:42 -0700 Christoph Hellwig <hch@infradead.org> wrote: > On Tue, Mar 21, 2023 at 05:26:49PM -0400, Josef Bacik wrote: > > We got the defaults based on our testing with our workloads inside of > > FB. Clearly this isn't representative of a normal desktop usage, but > > we've also got a lot of workloads so figured if it made the whole > > fleet happy it would probably be fine everywhere. > > > > That being said this is tunable for a reason, your workload seems to > > generate a lot of free'd extents and discards. We can probably mess > > with the async stuff to maybe pause discarding if there's no other > > activity happening on the device at the moment, but tuning it to let > > more discards through at a time is also completely valid. Thanks, > > FYI, discard performance differs a lot between different SSDs. > It used to be pretty horrible for most devices early on, and then a > certain hyperscaler started requiring decent performance for enterprise > drives, so many of them are good now. A lot less so for the typical > consumer drive, especially at the lower end of the spectrum. > > And that jut NVMe, the still shipping SATA SSDs are another different > story. Not helped by the fact that we don't even support ranged > discards for them in Linux. Josef, what did you use as a signal to detect what value was good enough? Did you crank up the number until discard backlog clears up in a reasonable time? I still don't understand what I should take into account to change the default and whether I should change it at all. Is it fine if the discard backlog takes a week to get through it? (Or a day? An hour? A minute?) Is it fine to send discards as fast as device allows instead of doing 10 IOPS? Does IOPS limit consider a device wearing tradeoff? Then low IOPS makes sense. Or IOPS limit is just a way to reserve most bandwidth to non-discard workloads? Then I would say unlimited IOPS as a default would make more sense for btrfs. -- Sergei [-- Attachment #2: Цифровая подпись OpenPGP --] [-- Type: application/pgp-signature, Size: 981 bytes --] ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async 2023-03-23 22:26 ` Sergei Trofimovich @ 2023-04-04 10:49 ` Linux regression tracking (Thorsten Leemhuis) 2023-04-04 16:04 ` Christoph Hellwig 2023-04-04 18:23 ` Boris Burkov 0 siblings, 2 replies; 27+ messages in thread From: Linux regression tracking (Thorsten Leemhuis) @ 2023-04-04 10:49 UTC (permalink / raw) To: Sergei Trofimovich, Christoph Hellwig Cc: Josef Bacik, Christopher Price, anand.jain, boris, clm, dsterba, linux-btrfs, regressions On 23.03.23 23:26, Sergei Trofimovich wrote: > On Wed, 22 Mar 2023 01:38:42 -0700 > Christoph Hellwig <hch@infradead.org> wrote: > >> On Tue, Mar 21, 2023 at 05:26:49PM -0400, Josef Bacik wrote: >>> We got the defaults based on our testing with our workloads inside of >>> FB. Clearly this isn't representative of a normal desktop usage, but >>> we've also got a lot of workloads so figured if it made the whole >>> fleet happy it would probably be fine everywhere. >>> >>> That being said this is tunable for a reason, your workload seems to >>> generate a lot of free'd extents and discards. We can probably mess >>> with the async stuff to maybe pause discarding if there's no other >>> activity happening on the device at the moment, but tuning it to let >>> more discards through at a time is also completely valid. Thanks, BTW, there is another report about this issue here: https://bugzilla.redhat.com/show_bug.cgi?id=2182228 /me wonders if there is a opensuse report as well, but a quick search didn't find one And as fun fact or for completeness, the issue even made it to reddit, too: https://www.reddit.com/r/archlinux/comments/121htxn/btrfs_discard_storm_on_62x_kernel/ >> FYI, discard performance differs a lot between different SSDs. >> It used to be pretty horrible for most devices early on, and then a >> certain hyperscaler started requiring decent performance for enterprise >> drives, so many of them are good now. A lot less so for the typical >> consumer drive, especially at the lower end of the spectrum. >> >> And that jut NVMe, the still shipping SATA SSDs are another different >> story. Not helped by the fact that we don't even support ranged >> discards for them in Linux. Thx for your comments Christoph. Quick question, just to be sure I understand things properly: I assume on some of those problematic devices these discard storms will lead to a performance regression? I also heard people saying these discard storms might reduce the life time of some devices - is that true? If the answer to at least one of these is "yes" I'd say we it might be best to revert 63a7cb130718 for now. > Josef, what did you use as a signal to detect what value was good > enough? Did you crank up the number until discard backlog clears up in a > reasonable time? > > I still don't understand what I should take into account to change the > default and whether I should change it at all. Is it fine if the discard > backlog takes a week to get through it? (Or a day? An hour? A minute?) > > Is it fine to send discards as fast as device allows instead of doing 10 > IOPS? Does IOPS limit consider a device wearing tradeoff? Then low IOPS > makes sense. Or IOPS limit is just a way to reserve most bandwidth to > non-discard workloads? Then I would say unlimited IOPS as a default > would make more sense for btrfs. /me would be interested in answers to these questions as well Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr If I did something stupid, please tell me, as explained on that page. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async 2023-04-04 10:49 ` Linux regression tracking (Thorsten Leemhuis) @ 2023-04-04 16:04 ` Christoph Hellwig 2023-04-04 16:20 ` Roman Mamedov 2023-04-04 18:15 ` Chris Mason 2023-04-04 18:23 ` Boris Burkov 1 sibling, 2 replies; 27+ messages in thread From: Christoph Hellwig @ 2023-04-04 16:04 UTC (permalink / raw) To: Linux regressions mailing list Cc: Sergei Trofimovich, Christoph Hellwig, Josef Bacik, Christopher Price, anand.jain, boris, clm, dsterba, linux-btrfs On Tue, Apr 04, 2023 at 12:49:40PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote: > >> And that jut NVMe, the still shipping SATA SSDs are another different > >> story. Not helped by the fact that we don't even support ranged > >> discards for them in Linux. > > Thx for your comments Christoph. Quick question, just to be sure I > understand things properly: > > I assume on some of those problematic devices these discard storms will > lead to a performance regression? Probably. > I also heard people saying these discard storms might reduce the life > time of some devices - is that true? Also very much possible. There are various SSDs that treat a discard as a write zeroes and always return zeroes from all discarded blocks. If the discards are smaller than or not aligned to the internal erase (super)blocks, this will actually cause additional writes. > If the answer to at least one of these is "yes" I'd say we it might be > best to revert 63a7cb130718 for now. I don't think enabling it is a very a smart idea for most consumer devices. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async 2023-04-04 16:04 ` Christoph Hellwig @ 2023-04-04 16:20 ` Roman Mamedov 2023-04-04 16:27 ` Christoph Hellwig 2023-04-04 18:15 ` Chris Mason 1 sibling, 1 reply; 27+ messages in thread From: Roman Mamedov @ 2023-04-04 16:20 UTC (permalink / raw) To: Christoph Hellwig Cc: Linux regressions mailing list, Sergei Trofimovich, Josef Bacik, Christopher Price, anand.jain, boris, clm, dsterba, linux-btrfs On Tue, 4 Apr 2023 09:04:03 -0700 Christoph Hellwig <hch@infradead.org> wrote: > > I also heard people saying these discard storms might reduce the life > > time of some devices - is that true? > > Also very much possible. There are various SSDs that treat a discard > as a write zeroes and always return zeroes from all discarded blocks. > If the discards are smaller than or not aligned to the internal erase > (super)blocks, this will actually cause additional writes. SSDs do not physically erase blocks on discard, that would be very slow. Instead they nuke corresponding records in the Flash translation layer (FTL) tables, so that the discarded areas point "nowhere" instead of the actual stored blocks. And when facing such pointers on trying to resolve read requests, the controller knows to just return zeroes. Of course there can be varying behaviors per SSD, e.g. I know of some that return random garbage instead of zeroes, and some which for a puzzling reason prefer to return the byte FF instead. But I think the 1st point above should be universal, pretty certain there are none where a discard/TRIM would take comparable time to "dd if=/dev/zero of=/dev/ssd" (making it unusable in practice). -- With respect, Roman ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async 2023-04-04 16:20 ` Roman Mamedov @ 2023-04-04 16:27 ` Christoph Hellwig 2023-04-04 23:37 ` Damien Le Moal 0 siblings, 1 reply; 27+ messages in thread From: Christoph Hellwig @ 2023-04-04 16:27 UTC (permalink / raw) To: Roman Mamedov Cc: Christoph Hellwig, Linux regressions mailing list, Sergei Trofimovich, Josef Bacik, Christopher Price, anand.jain, boris, clm, dsterba, linux-btrfs On Tue, Apr 04, 2023 at 09:20:27PM +0500, Roman Mamedov wrote: > SSDs do not physically erase blocks on discard, that would be very slow. > > Instead they nuke corresponding records in the Flash translation layer (FTL) > tables, so that the discarded areas point "nowhere" instead of the actual > stored blocks. And when facing such pointers on trying to resolve read > requests, the controller knows to just return zeroes. Of course they don't erase blocks on every discard (although if you look long enough you'll probably find a worst case implementation that does this anyway). But you still need to persist your FTL changes, and the zeroing if any was done by the time your get a FLUSH command, because without that you'd return different data when reading a block after a powerfail (i.e. the old data) than before (zeros or a pattern), which is a no-go. > Of course there can be varying behaviors per SSD, e.g. I know of some that > return random garbage instead of zeroes, and some which for a puzzling reason > prefer to return the byte FF instead. All of that is valid behavior per the relevant standards. > But I think the 1st point above should > be universal, pretty certain there are none where a discard/TRIM would take > comparable time to "dd if=/dev/zero of=/dev/ssd" (making it unusable in > practice). This is wishful thinking :) SSDs generall optimize the fast path very heavily, so slow path command even when they should in theory be faster due to the underlying optimizations might not be, as they are processed in software instead of hardware offloads, moved to slower cores, etc. For discard things have gotten a lot better in the last years, but for many older devices performance can be outright horrible. For SATA SSDs the fact that classic TRIM isn't a queued command adds insult to injury as it always means draining the queue first and not issuing any I/O until the TRIM command is done. There is a FPDMA version not, but I don't think it ws all that widely implemented before SATA SSDs fell out of favour. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async 2023-04-04 16:27 ` Christoph Hellwig @ 2023-04-04 23:37 ` Damien Le Moal 0 siblings, 0 replies; 27+ messages in thread From: Damien Le Moal @ 2023-04-04 23:37 UTC (permalink / raw) To: Christoph Hellwig, Roman Mamedov Cc: Linux regressions mailing list, Sergei Trofimovich, Josef Bacik, Christopher Price, anand.jain, boris, clm, dsterba, linux-btrfs On 4/5/23 01:27, Christoph Hellwig wrote: > On Tue, Apr 04, 2023 at 09:20:27PM +0500, Roman Mamedov wrote: >> SSDs do not physically erase blocks on discard, that would be very slow. >> >> Instead they nuke corresponding records in the Flash translation layer (FTL) >> tables, so that the discarded areas point "nowhere" instead of the actual >> stored blocks. And when facing such pointers on trying to resolve read >> requests, the controller knows to just return zeroes. > > Of course they don't erase blocks on every discard (although if you look > long enough you'll probably find a worst case implementation that does > this anyway). But you still need to persist your FTL changes, and the > zeroing if any was done by the time your get a FLUSH command, because > without that you'd return different data when reading a block after a > powerfail (i.e. the old data) than before (zeros or a pattern), which is > a no-go. > >> Of course there can be varying behaviors per SSD, e.g. I know of some that >> return random garbage instead of zeroes, and some which for a puzzling reason >> prefer to return the byte FF instead. > > All of that is valid behavior per the relevant standards. > >> But I think the 1st point above should >> be universal, pretty certain there are none where a discard/TRIM would take >> comparable time to "dd if=/dev/zero of=/dev/ssd" (making it unusable in >> practice). > > This is wishful thinking :) SSDs generall optimize the fast path very > heavily, so slow path command even when they should in theory be faster > due to the underlying optimizations might not be, as they are processed > in software instead of hardware offloads, moved to slower cores, etc. > > For discard things have gotten a lot better in the last years, but for > many older devices performance can be outright horrible. > > For SATA SSDs the fact that classic TRIM isn't a queued command adds > insult to injury as it always means draining the queue first and not > issuing any I/O until the TRIM command is done. There is a FPDMA > version not, but I don't think it ws all that widely implemented before > SATA SSDs fell out of favour. Not to mention that many of the SATA SSDs actually implementing queued trim are buggy. See ATA horkage flag ATA_HORKAGE_NO_NCQ_TRIM and the many consumer SSDs that need that flag. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async 2023-04-04 16:04 ` Christoph Hellwig 2023-04-04 16:20 ` Roman Mamedov @ 2023-04-04 18:15 ` Chris Mason 2023-04-04 18:51 ` Boris Burkov ` (2 more replies) 1 sibling, 3 replies; 27+ messages in thread From: Chris Mason @ 2023-04-04 18:15 UTC (permalink / raw) To: Christoph Hellwig, Linux regressions mailing list Cc: Sergei Trofimovich, Josef Bacik, Christopher Price, anand.jain, boris, clm, dsterba, linux-btrfs On 4/4/23 12:04 PM, Christoph Hellwig wrote: > On Tue, Apr 04, 2023 at 12:49:40PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote: >>>> And that jut NVMe, the still shipping SATA SSDs are another different >>>> story. Not helped by the fact that we don't even support ranged >>>> discards for them in Linux. >> >> Thx for your comments Christoph. Quick question, just to be sure I >> understand things properly: >> >> I assume on some of those problematic devices these discard storms will >> lead to a performance regression? I'm searching through the various threads, but I don't think I've seen the discard storm quantified? Boris sent me this: https://lore.kernel.org/linux-btrfs/ZCxP%2Fll7YjPdb9Ou@infradead.org/T/#m65851e5b8b0caa5320d2b7e322805dd200686f01 Which seems to match the 10 discards per second setting? We should be doing more of a dribble than a storm, so I'd like to understand if this is a separate bug that should be fixed. > > Probably. > >> I also heard people saying these discard storms might reduce the life >> time of some devices - is that true? > > Also very much possible. There are various SSDs that treat a discard > as a write zeroes and always return zeroes from all discarded blocks. > If the discards are smaller than or not aligned to the internal erase > (super)blocks, this will actually cause additional writes. > >> If the answer to at least one of these is "yes" I'd say we it might be >> best to revert 63a7cb130718 for now. > > I don't think enabling it is a very a smart idea for most consumer > devices. It seems like a good time to talk through a variations of discard usage in fb data centers. We run a pretty wide variety of hardware from consumer grade ssds to enterprise ssds, and we've run these on ext4/btrfs/xfs. (Christoph knows most of this already, so I'm only partially replying to him here) First, there was synchronous discard. These were pretty dark times because all three of our filesystems would build a batch of synchronous discards and then wait for them during filesystem commit. There were long tail latencies across all of our workloads, and so workload owners would turn off discard and declare victory over terrible latencies. Of course this predictably ends up with GC on the drives leading to terrible latencies because we weren't discarding anymore, and nightly trims are the obvious answer. Different workloads would gyrate through the variations and the only consistent result was unhappiness. Some places in the fleet still do this, and it can be a pretty simple tradeoff between the IO impacts of full drive trims vs the latency impact of built up GC vs over-provisioning. It works for consistent workloads, but honestly there aren't many of those. Along the way both btrfs and xfs have grown variations of async discard. The XFS one (sorry if I'm out of date here), didn't include any kind of rate limiting, so if you were bulk deleting a lot of data, XFS would effectively queue up so many discards that it actually saturated the device for a long time, starving reads and writes. If your workload did a constant stream of allocation and deletion, the async discards would just saturate the drive forever. The workloads that care about latencies on XFS ended up going back to synchronous discards, and they do a slow-rm hack that nibbles away at the ends of files with periodic fsyncs mixed in until the file is zero length. They love this and it makes me cry. The btrfs async discard feature was meant to address both of these cases. The primary features: - Get rid of the transaction commit latency - Enable allocations to steal from discards, reducing discard IO - Avoid saturating the devices with discards by metering them out Christoph mentions that modern enterprise drives are much better at discarding, and we see this in production too. But, we still have workloads that switched from XFS to Btrfs because the async discard feature did a better job of reducing drive write-amp and latencies. So, honestly from my POV the async discard is best suited to consumer devices. Our defaults are probably wrong because no matter what you choose there's a drive out there that makes it look bad. Also, laptops probably don't want the slow dribble. I know Boris has some ideas on how to make the defaults better, so I'll let him chime in there. -chris ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async 2023-04-04 18:15 ` Chris Mason @ 2023-04-04 18:51 ` Boris Burkov 2023-04-04 19:22 ` David Sterba 2023-04-10 2:03 ` Michael Bromilow 2023-04-04 19:08 ` Sergei Trofimovich 2023-04-05 6:18 ` Christoph Hellwig 2 siblings, 2 replies; 27+ messages in thread From: Boris Burkov @ 2023-04-04 18:51 UTC (permalink / raw) To: Chris Mason Cc: Christoph Hellwig, Linux regressions mailing list, Sergei Trofimovich, Josef Bacik, Christopher Price, anand.jain, clm, dsterba, linux-btrfs On Tue, Apr 04, 2023 at 02:15:38PM -0400, Chris Mason wrote: > On 4/4/23 12:04 PM, Christoph Hellwig wrote: > > On Tue, Apr 04, 2023 at 12:49:40PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote: > >>>> And that jut NVMe, the still shipping SATA SSDs are another different > >>>> story. Not helped by the fact that we don't even support ranged > >>>> discards for them in Linux. > >> > >> Thx for your comments Christoph. Quick question, just to be sure I > >> understand things properly: > >> > >> I assume on some of those problematic devices these discard storms will > >> lead to a performance regression? > > I'm searching through the various threads, but I don't think I've seen > the discard storm quantified? > > Boris sent me this: > https://lore.kernel.org/linux-btrfs/ZCxP%2Fll7YjPdb9Ou@infradead.org/T/#m65851e5b8b0caa5320d2b7e322805dd200686f01 > > Which seems to match the 10 discards per second setting? We should be > doing more of a dribble than a storm, so I'd like to understand if this > is a separate bug that should be fixed. > > > > > Probably. > > > >> I also heard people saying these discard storms might reduce the life > >> time of some devices - is that true? > > > > Also very much possible. There are various SSDs that treat a discard > > as a write zeroes and always return zeroes from all discarded blocks. > > If the discards are smaller than or not aligned to the internal erase > > (super)blocks, this will actually cause additional writes. > > > >> If the answer to at least one of these is "yes" I'd say we it might be > >> best to revert 63a7cb130718 for now. > > > > I don't think enabling it is a very a smart idea for most consumer > > devices. > > It seems like a good time to talk through a variations of discard usage > in fb data centers. We run a pretty wide variety of hardware from > consumer grade ssds to enterprise ssds, and we've run these on > ext4/btrfs/xfs. > > (Christoph knows most of this already, so I'm only partially replying to > him here) > > First, there was synchronous discard. These were pretty dark times > because all three of our filesystems would build a batch of synchronous > discards and then wait for them during filesystem commit. There were > long tail latencies across all of our workloads, and so workload owners > would turn off discard and declare victory over terrible latencies. > > Of course this predictably ends up with GC on the drives leading to > terrible latencies because we weren't discarding anymore, and nightly > trims are the obvious answer. Different workloads would gyrate through > the variations and the only consistent result was unhappiness. > > Some places in the fleet still do this, and it can be a pretty simple > tradeoff between the IO impacts of full drive trims vs the latency > impact of built up GC vs over-provisioning. It works for consistent > workloads, but honestly there aren't many of those. > > Along the way both btrfs and xfs have grown variations of async discard. > The XFS one (sorry if I'm out of date here), didn't include any kind of > rate limiting, so if you were bulk deleting a lot of data, XFS would > effectively queue up so many discards that it actually saturated the > device for a long time, starving reads and writes. If your workload did > a constant stream of allocation and deletion, the async discards would > just saturate the drive forever. > > The workloads that care about latencies on XFS ended up going back to > synchronous discards, and they do a slow-rm hack that nibbles away at > the ends of files with periodic fsyncs mixed in until the file is zero > length. They love this and it makes me cry. > > The btrfs async discard feature was meant to address both of these > cases. The primary features: > > - Get rid of the transaction commit latency > - Enable allocations to steal from discards, reducing discard IO > - Avoid saturating the devices with discards by metering them out > > Christoph mentions that modern enterprise drives are much better at > discarding, and we see this in production too. But, we still have > workloads that switched from XFS to Btrfs because the async discard > feature did a better job of reducing drive write-amp and latencies. > > So, honestly from my POV the async discard is best suited to consumer > devices. Our defaults are probably wrong because no matter what you > choose there's a drive out there that makes it look bad. Also, laptops > probably don't want the slow dribble. > > I know Boris has some ideas on how to make the defaults better, so I'll > let him chime in there. > > -chris > Our reasonable options, as I see them: - back to nodiscard, rely on periodic trims from the OS. - leave low iops_limit, drives stay busy unexpectedly long, conclude that that's OK, and communicate the tuning/measurement options better. - set a high iops_limit (e.g. 1000) drives will get back to idle faster. - change an unset iops_limit to mean truly unlimited async discard, set that as the default, and anyone who cares to meter it can set an iops_limit. The regression here is in drive idle time due to modest discard getting metered out over minutes rather than dealt with relatively quickly. So I would favor the unlimited async discard mode and will send a patch to that effect which we can discuss. IMO, the periodic discard cron screwing up your box once a week or once a day or whatever is a pretty bad user experience as well, as is randomly hitting bad latencies because you haven't been discarding often enough. Boris ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async 2023-04-04 18:51 ` Boris Burkov @ 2023-04-04 19:22 ` David Sterba 2023-04-04 19:39 ` Boris Burkov 2023-04-10 2:03 ` Michael Bromilow 1 sibling, 1 reply; 27+ messages in thread From: David Sterba @ 2023-04-04 19:22 UTC (permalink / raw) To: Boris Burkov Cc: Chris Mason, Christoph Hellwig, Linux regressions mailing list, Sergei Trofimovich, Josef Bacik, Christopher Price, anand.jain, clm, dsterba, linux-btrfs On Tue, Apr 04, 2023 at 11:51:51AM -0700, Boris Burkov wrote: > Our reasonable options, as I see them: > - back to nodiscard, rely on periodic trims from the OS. We had that before and it's a fallback in case we can't fix it but still the problem would persist for anyone enabling async discard so it's only limiting the impact. > - leave low iops_limit, drives stay busy unexpectedly long, conclude that > that's OK, and communicate the tuning/measurement options better. This does not sound user friendly, tuning should be possible but not required by default. We already have enough other things that users need to decide and in this case I don't know if there's enough information to even make a good decision upfront. > - set a high iops_limit (e.g. 1000) drives will get back to idle faster. > - change an unset iops_limit to mean truly unlimited async discard, set > that as the default, and anyone who cares to meter it can set an > iops_limit. > > The regression here is in drive idle time due to modest discard getting > metered out over minutes rather than dealt with relatively quickly. So > I would favor the unlimited async discard mode and will send a patch to > that effect which we can discuss. Can we do options 3 and 4, i.e. set a high iops so that the batch gets processed faster and (4) that there's the manual override to drop the limit completely? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async 2023-04-04 19:22 ` David Sterba @ 2023-04-04 19:39 ` Boris Burkov 2023-04-05 8:17 ` Linux regression tracking (Thorsten Leemhuis) 0 siblings, 1 reply; 27+ messages in thread From: Boris Burkov @ 2023-04-04 19:39 UTC (permalink / raw) To: David Sterba Cc: Chris Mason, Christoph Hellwig, Linux regressions mailing list, Sergei Trofimovich, Josef Bacik, Christopher Price, anand.jain, clm, dsterba, linux-btrfs On Tue, Apr 04, 2023 at 09:22:05PM +0200, David Sterba wrote: > On Tue, Apr 04, 2023 at 11:51:51AM -0700, Boris Burkov wrote: > > Our reasonable options, as I see them: > > - back to nodiscard, rely on periodic trims from the OS. > > We had that before and it's a fallback in case we can't fix it but still > the problem would persist for anyone enabling async discard so it's only > limiting the impact. > > > - leave low iops_limit, drives stay busy unexpectedly long, conclude that > > that's OK, and communicate the tuning/measurement options better. > > This does not sound user friendly, tuning should be possible but not > required by default. We already have enough other things that users need > to decide and in this case I don't know if there's enough information to > even make a good decision upfront. > > > - set a high iops_limit (e.g. 1000) drives will get back to idle faster. > > - change an unset iops_limit to mean truly unlimited async discard, set > > that as the default, and anyone who cares to meter it can set an > > iops_limit. > > > > The regression here is in drive idle time due to modest discard getting > > metered out over minutes rather than dealt with relatively quickly. So > > I would favor the unlimited async discard mode and will send a patch to > > that effect which we can discuss. > > Can we do options 3 and 4, i.e. set a high iops so that the batch gets > processed faster and (4) that there's the manual override to drop the > limit completely? Yup, I'm on it. Chris also had a good idea for a "time since commit" check to further limit how long async discard would crank for. This would look something like "go as fast as I'm allowed for 5s after every transaction". It could lead to a big buildup that requires a periodic trim to clear. I'm playing with adding that as an additional tunable which would apply if set. Boris ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async 2023-04-04 19:39 ` Boris Burkov @ 2023-04-05 8:17 ` Linux regression tracking (Thorsten Leemhuis) 0 siblings, 0 replies; 27+ messages in thread From: Linux regression tracking (Thorsten Leemhuis) @ 2023-04-05 8:17 UTC (permalink / raw) To: Boris Burkov, David Sterba Cc: Chris Mason, Christoph Hellwig, Linux regressions mailing list, Sergei Trofimovich, Josef Bacik, Christopher Price, anand.jain, clm, dsterba, linux-btrfs On 04.04.23 21:39, Boris Burkov wrote: > On Tue, Apr 04, 2023 at 09:22:05PM +0200, David Sterba wrote: >> On Tue, Apr 04, 2023 at 11:51:51AM -0700, Boris Burkov wrote: >>> Our reasonable options, as I see them: >>> - back to nodiscard, rely on periodic trims from the OS. >> >> We had that before and it's a fallback in case we can't fix it but still >> the problem would persist for anyone enabling async discard so it's only >> limiting the impact. >> >>> - leave low iops_limit, drives stay busy unexpectedly long, conclude that >>> that's OK, and communicate the tuning/measurement options better. >> >> This does not sound user friendly, tuning should be possible but not >> required by default. We already have enough other things that users need >> to decide and in this case I don't know if there's enough information to >> even make a good decision upfront. >> >>> - set a high iops_limit (e.g. 1000) drives will get back to idle faster. >>> - change an unset iops_limit to mean truly unlimited async discard, set >>> that as the default, and anyone who cares to meter it can set an >>> iops_limit. >>> >>> The regression here is in drive idle time due to modest discard getting >>> metered out over minutes rather than dealt with relatively quickly. So >>> I would favor the unlimited async discard mode and will send a patch to >>> that effect which we can discuss. >> >> Can we do options 3 and 4, i.e. set a high iops so that the batch gets >> processed faster and (4) that there's the manual override to drop the >> limit completely? > > Yup, I'm on it. > [...] Thx everyone for looking into this and the fruitful discussion yesterday, much appreciated. /me will now get back to watching this only from afar and leave things to the experts Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr If I did something stupid, please tell me, as explained on that page. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async 2023-04-04 18:51 ` Boris Burkov 2023-04-04 19:22 ` David Sterba @ 2023-04-10 2:03 ` Michael Bromilow 2023-04-11 17:52 ` David Sterba 1 sibling, 1 reply; 27+ messages in thread From: Michael Bromilow @ 2023-04-10 2:03 UTC (permalink / raw) To: Boris Burkov, Chris Mason Cc: Christoph Hellwig, Linux regressions mailing list, Sergei Trofimovich, Josef Bacik, Christopher Price, anand.jain, clm, dsterba, linux-btrfs On 04/04/2023 19:51, Boris Burkov wrote: > On Tue, Apr 04, 2023 at 02:15:38PM -0400, Chris Mason wrote: >> So, honestly from my POV the async discard is best suited to consumer >> devices. Our defaults are probably wrong because no matter what you >> choose there's a drive out there that makes it look bad. Also, laptops >> probably don't want the slow dribble. > > Our reasonable options, as I see them: > - back to nodiscard, rely on periodic trims from the OS. > - leave low iops_limit, drives stay busy unexpectedly long, conclude that > that's OK, and communicate the tuning/measurement options better. > - set a high iops_limit (e.g. 1000) drives will get back to idle faster. > - change an unset iops_limit to mean truly unlimited async discard, set > that as the default, and anyone who cares to meter it can set an > iops_limit. > > The regression here is in drive idle time due to modest discard getting > metered out over minutes rather than dealt with relatively quickly. So > I would favor the unlimited async discard mode and will send a patch to > that effect which we can discuss. Will power usage be taken into consideration? I only noticed this regression myself when my laptop started to draw a constant extra ~10W from the constant drive activity, so I imagine other laptop users would also prefer a default which avoids this if possible. If "relatively quickly" means anything more than a few seconds I could see that causing rather significant battery life reductions. - Michael ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async 2023-04-10 2:03 ` Michael Bromilow @ 2023-04-11 17:52 ` David Sterba 2023-04-11 18:15 ` Linux regression tracking (Thorsten Leemhuis) 0 siblings, 1 reply; 27+ messages in thread From: David Sterba @ 2023-04-11 17:52 UTC (permalink / raw) To: Michael Bromilow Cc: Boris Burkov, Chris Mason, Christoph Hellwig, Linux regressions mailing list, Sergei Trofimovich, Josef Bacik, Christopher Price, anand.jain, clm, dsterba, linux-btrfs On Mon, Apr 10, 2023 at 03:03:37AM +0100, Michael Bromilow wrote: > On 04/04/2023 19:51, Boris Burkov wrote: > > On Tue, Apr 04, 2023 at 02:15:38PM -0400, Chris Mason wrote: > >> So, honestly from my POV the async discard is best suited to consumer > >> devices. Our defaults are probably wrong because no matter what you > >> choose there's a drive out there that makes it look bad. Also, laptops > >> probably don't want the slow dribble. > > > > Our reasonable options, as I see them: > > - back to nodiscard, rely on periodic trims from the OS. > > - leave low iops_limit, drives stay busy unexpectedly long, conclude that > > that's OK, and communicate the tuning/measurement options better. > > - set a high iops_limit (e.g. 1000) drives will get back to idle faster. > > - change an unset iops_limit to mean truly unlimited async discard, set > > that as the default, and anyone who cares to meter it can set an > > iops_limit. > > > > The regression here is in drive idle time due to modest discard getting > > metered out over minutes rather than dealt with relatively quickly. So > > I would favor the unlimited async discard mode and will send a patch to > > that effect which we can discuss. > > Will power usage be taken into consideration? I only noticed this regression > myself when my laptop started to draw a constant extra ~10W from the constant > drive activity, so I imagine other laptop users would also prefer a default > which avoids this if possible. If "relatively quickly" means anything more > than a few seconds I could see that causing rather significant battery life > reductions. This may need to be configured from userspace, e.g. from laptop-mode-tools, I'm not sure if any of the power usage info is available from kernel. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async 2023-04-11 17:52 ` David Sterba @ 2023-04-11 18:15 ` Linux regression tracking (Thorsten Leemhuis) 0 siblings, 0 replies; 27+ messages in thread From: Linux regression tracking (Thorsten Leemhuis) @ 2023-04-11 18:15 UTC (permalink / raw) To: dsterba, Michael Bromilow Cc: Boris Burkov, Chris Mason, Christoph Hellwig, Linux regressions mailing list, Sergei Trofimovich, Josef Bacik, Christopher Price, anand.jain, clm, dsterba, linux-btrfs On 11.04.23 19:52, David Sterba wrote: > On Mon, Apr 10, 2023 at 03:03:37AM +0100, Michael Bromilow wrote: >> On 04/04/2023 19:51, Boris Burkov wrote: >>> On Tue, Apr 04, 2023 at 02:15:38PM -0400, Chris Mason wrote: >>>> So, honestly from my POV the async discard is best suited to consumer >>>> devices. Our defaults are probably wrong because no matter what you >>>> choose there's a drive out there that makes it look bad. Also, laptops >>>> probably don't want the slow dribble. >>> >>> Our reasonable options, as I see them: >>> - back to nodiscard, rely on periodic trims from the OS. >>> - leave low iops_limit, drives stay busy unexpectedly long, conclude that >>> that's OK, and communicate the tuning/measurement options better. >>> - set a high iops_limit (e.g. 1000) drives will get back to idle faster. >>> - change an unset iops_limit to mean truly unlimited async discard, set >>> that as the default, and anyone who cares to meter it can set an >>> iops_limit. >>> >>> The regression here is in drive idle time due to modest discard getting >>> metered out over minutes rather than dealt with relatively quickly. So >>> I would favor the unlimited async discard mode and will send a patch to >>> that effect which we can discuss. >> >> Will power usage be taken into consideration? I only noticed this regression >> myself when my laptop started to draw a constant extra ~10W from the constant >> drive activity, so I imagine other laptop users would also prefer a default >> which avoids this if possible. If "relatively quickly" means anything more >> than a few seconds I could see that causing rather significant battery life >> reductions. > > This may need to be configured from userspace, e.g. from > laptop-mode-tools, I'm not sure if any of the power usage info is > available from kernel. Well, from the point of the regression tracker this looks a bit different: if a system after a kernel updates draw a constant extra ~10W, it's a regression that needs to be fixed. But in the end it's of course Linus option that matters. But whatever, Boris afaics is looking for an improved solution that hopefully avoids the extra power consumption and at the same time does the right thing on modern enterprise storage devices, so that the defaults work for the majority of the users. Hopefully this will work out. :-D Ciao, Thorsten ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async 2023-04-04 18:15 ` Chris Mason 2023-04-04 18:51 ` Boris Burkov @ 2023-04-04 19:08 ` Sergei Trofimovich 2023-04-05 6:18 ` Christoph Hellwig 2 siblings, 0 replies; 27+ messages in thread From: Sergei Trofimovich @ 2023-04-04 19:08 UTC (permalink / raw) To: Chris Mason Cc: Christoph Hellwig, Linux regressions mailing list, Josef Bacik, Christopher Price, anand.jain, boris, clm, dsterba, linux-btrfs On Tue, 4 Apr 2023 14:15:38 -0400 Chris Mason <clm@meta.com> wrote: > On 4/4/23 12:04 PM, Christoph Hellwig wrote: > > On Tue, Apr 04, 2023 at 12:49:40PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote: > >>>> And that jut NVMe, the still shipping SATA SSDs are another different > >>>> story. Not helped by the fact that we don't even support ranged > >>>> discards for them in Linux. > >> > >> Thx for your comments Christoph. Quick question, just to be sure I > >> understand things properly: > >> > >> I assume on some of those problematic devices these discard storms will > >> lead to a performance regression? > > I'm searching through the various threads, but I don't think I've seen > the discard storm quantified? > > Boris sent me this: > https://lore.kernel.org/linux-btrfs/ZCxP%2Fll7YjPdb9Ou@infradead.org/T/#m65851e5b8b0caa5320d2b7e322805dd200686f01 > > Which seems to match the 10 discards per second setting? We should be > doing more of a dribble than a storm, so I'd like to understand if this > is a separate bug that should be fixed. I called it a storm based on the following LED behaviour: https://trofi.github.io/posts.data/281-a-signal-from-the-stars/01-shimmer.gif It's not a saturated device (as you can see the flashes on and off). But it's not idle either. "storm" was an exaggeration. Note that it takes linux 3-4 days to go through a backlog of 300GB of deleted files. Is 3-4 days expected? Or my setup is somehow unusual? Maybe `compress=zstd` makes things disproportionally problematic? > > Probably. > > > >> I also heard people saying these discard storms might reduce the life > >> time of some devices - is that true? > > > > Also very much possible. There are various SSDs that treat a discard > > as a write zeroes and always return zeroes from all discarded blocks. > > If the discards are smaller than or not aligned to the internal erase > > (super)blocks, this will actually cause additional writes. > > > >> If the answer to at least one of these is "yes" I'd say we it might be > >> best to revert 63a7cb130718 for now. > > > > I don't think enabling it is a very a smart idea for most consumer > > devices. > > It seems like a good time to talk through a variations of discard usage > in fb data centers. We run a pretty wide variety of hardware from > consumer grade ssds to enterprise ssds, and we've run these on > ext4/btrfs/xfs. > > (Christoph knows most of this already, so I'm only partially replying to > him here) > > First, there was synchronous discard. These were pretty dark times > because all three of our filesystems would build a batch of synchronous > discards and then wait for them during filesystem commit. There were > long tail latencies across all of our workloads, and so workload owners > would turn off discard and declare victory over terrible latencies. > > Of course this predictably ends up with GC on the drives leading to > terrible latencies because we weren't discarding anymore, and nightly > trims are the obvious answer. Different workloads would gyrate through > the variations and the only consistent result was unhappiness. > > Some places in the fleet still do this, and it can be a pretty simple > tradeoff between the IO impacts of full drive trims vs the latency > impact of built up GC vs over-provisioning. It works for consistent > workloads, but honestly there aren't many of those. > > Along the way both btrfs and xfs have grown variations of async discard. > The XFS one (sorry if I'm out of date here), didn't include any kind of > rate limiting, so if you were bulk deleting a lot of data, XFS would > effectively queue up so many discards that it actually saturated the > device for a long time, starving reads and writes. If your workload did > a constant stream of allocation and deletion, the async discards would > just saturate the drive forever. > > The workloads that care about latencies on XFS ended up going back to > synchronous discards, and they do a slow-rm hack that nibbles away at > the ends of files with periodic fsyncs mixed in until the file is zero > length. They love this and it makes me cry. > > The btrfs async discard feature was meant to address both of these > cases. The primary features: > > - Get rid of the transaction commit latency > - Enable allocations to steal from discards, reducing discard IO > - Avoid saturating the devices with discards by metering them out > > Christoph mentions that modern enterprise drives are much better at > discarding, and we see this in production too. But, we still have > workloads that switched from XFS to Btrfs because the async discard > feature did a better job of reducing drive write-amp and latencies. > > So, honestly from my POV the async discard is best suited to consumer > devices. Our defaults are probably wrong because no matter what you > choose there's a drive out there that makes it look bad. Also, laptops > probably don't want the slow dribble. > > I know Boris has some ideas on how to make the defaults better, so I'll > let him chime in there. That makes sense. Thank you! -- Sergei ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async 2023-04-04 18:15 ` Chris Mason 2023-04-04 18:51 ` Boris Burkov 2023-04-04 19:08 ` Sergei Trofimovich @ 2023-04-05 6:18 ` Christoph Hellwig 2023-04-05 12:01 ` Chris Mason 2 siblings, 1 reply; 27+ messages in thread From: Christoph Hellwig @ 2023-04-05 6:18 UTC (permalink / raw) To: Chris Mason Cc: Christoph Hellwig, Linux regressions mailing list, Sergei Trofimovich, Josef Bacik, Christopher Price, anand.jain, boris, clm, dsterba, linux-btrfs On Tue, Apr 04, 2023 at 02:15:38PM -0400, Chris Mason wrote: > It seems like a good time to talk through a variations of discard usage > in fb data centers. We run a pretty wide variety of hardware from > consumer grade ssds to enterprise ssds, and we've run these on > ext4/btrfs/xfs. Remember what you guys call consumer grade is the top of the shelve consumer grade SSDs, or in fact low-end enterprise ones that are based on the latest and greatest top shelve SSDs with firmware specifically tweaked for you. That is a whole differnet level from random crappy no-name SSD from 5 years ago. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async 2023-04-05 6:18 ` Christoph Hellwig @ 2023-04-05 12:01 ` Chris Mason 0 siblings, 0 replies; 27+ messages in thread From: Chris Mason @ 2023-04-05 12:01 UTC (permalink / raw) To: Christoph Hellwig Cc: Linux regressions mailing list, Sergei Trofimovich, Josef Bacik, Christopher Price, anand.jain, boris, clm, dsterba, linux-btrfs On 4/5/23 2:18 AM, Christoph Hellwig wrote: > On Tue, Apr 04, 2023 at 02:15:38PM -0400, Chris Mason wrote: >> It seems like a good time to talk through a variations of discard usage >> in fb data centers. We run a pretty wide variety of hardware from >> consumer grade ssds to enterprise ssds, and we've run these on >> ext4/btrfs/xfs. > > Remember what you guys call consumer grade is the top of the shelve > consumer grade SSDs, or in fact low-end enterprise ones that are based > on the latest and greatest top shelve SSDs with firmware specifically > tweaked for you. That is a whole differnet level from random crappy > no-name SSD from 5 years ago. > This is definitely true, our providers are reliable and support the products well. But, you also have to factor in the write cycles, drive lifetimes, and just the constant threat of statistics at scale. Drives might come in off something close to the top shelf, but datacenters are pretty hard on storage. Going back to the fedora thread, we've definitely had ssds with reliability issues caused by trims, but we haven't had any I can think of where mount -o discard was fundamentally less reliable than periodic full device trimming. -chris ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async 2023-04-04 10:49 ` Linux regression tracking (Thorsten Leemhuis) 2023-04-04 16:04 ` Christoph Hellwig @ 2023-04-04 18:23 ` Boris Burkov 2023-04-04 19:12 ` Sergei Trofimovich 1 sibling, 1 reply; 27+ messages in thread From: Boris Burkov @ 2023-04-04 18:23 UTC (permalink / raw) To: Linux regression tracking (Thorsten Leemhuis) Cc: Sergei Trofimovich, Christoph Hellwig, Josef Bacik, Christopher Price, anand.jain, clm, dsterba, linux-btrfs, regressions On Tue, Apr 04, 2023 at 12:49:40PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote: > On 23.03.23 23:26, Sergei Trofimovich wrote: > > On Wed, 22 Mar 2023 01:38:42 -0700 > > Christoph Hellwig <hch@infradead.org> wrote: > > > >> On Tue, Mar 21, 2023 at 05:26:49PM -0400, Josef Bacik wrote: > >>> We got the defaults based on our testing with our workloads inside of > >>> FB. Clearly this isn't representative of a normal desktop usage, but > >>> we've also got a lot of workloads so figured if it made the whole > >>> fleet happy it would probably be fine everywhere. > >>> > >>> That being said this is tunable for a reason, your workload seems to > >>> generate a lot of free'd extents and discards. We can probably mess > >>> with the async stuff to maybe pause discarding if there's no other > >>> activity happening on the device at the moment, but tuning it to let > >>> more discards through at a time is also completely valid. Thanks, > > BTW, there is another report about this issue here: > https://bugzilla.redhat.com/show_bug.cgi?id=2182228 > > /me wonders if there is a opensuse report as well, but a quick search > didn't find one > > And as fun fact or for completeness, the issue even made it to reddit, too: > https://www.reddit.com/r/archlinux/comments/121htxn/btrfs_discard_storm_on_62x_kernel/ Good find, but also: https://www.reddit.com/r/Fedora/comments/vjfpkv/periodic_trim_freezes_ssd/ So without harder data, there is a bit of bias inherent in cherrypicking negative impressions from the internet. > > >> FYI, discard performance differs a lot between different SSDs. > >> It used to be pretty horrible for most devices early on, and then a > >> certain hyperscaler started requiring decent performance for enterprise > >> drives, so many of them are good now. A lot less so for the typical > >> consumer drive, especially at the lower end of the spectrum. > >> > >> And that jut NVMe, the still shipping SATA SSDs are another different > >> story. Not helped by the fact that we don't even support ranged > >> discards for them in Linux. > > Thx for your comments Christoph. Quick question, just to be sure I > understand things properly: > > I assume on some of those problematic devices these discard storms will > lead to a performance regression? > > I also heard people saying these discard storms might reduce the life > time of some devices - is that true? > > If the answer to at least one of these is "yes" I'd say we it might be > best to revert 63a7cb130718 for now. > > > Josef, what did you use as a signal to detect what value was good > > enough? Did you crank up the number until discard backlog clears up in a > > reasonable time? Josef is OOO, so I'll try to clarify some things around async discard, hopefully it's helpful to anyone wondering how to tune it. Like you guessed, our tuning basically consists of looking at the discardable_extents/discardable_bytes metric in the fleet and ensuring it looks sane, and that we see an improvement in I/O tail latencies or fix some concrete instances of bad tail latencies. e.g. with iops_limit=10, we see concrete cases of bad latency go away and don't see a steady buildup of discardable_extents. > > > > I still don't understand what I should take into account to change the > > default and whether I should change it at all. Is it fine if the discard > > backlog takes a week to get through it? (Or a day? An hour? A minute?) I believe the relevant metrics are: - number of trims issued/bytes trimmed (you would measure this by tracing and by looking at discard_extent_bytes and discard_bitmap_bytes) - bytes "wrongly" trimmed. (extents that were reallocated without getting trimmed are exposed in discard_bytes_saved, so if that drops, you are maybe trimming things that you may have not needed to) - discardable_extents/discardable_bytes (in sysfs; the outstanding work) - tail latency of file system operations - disk idle time By doing periodic trim you tradeoff better bytes_saved and better disk idle time (big trim once a week, vs. "trim all the time" against worse tail latency during the trim itself, and risking trimming too infrequently, leading to worse latency on a drive that needs a trim. > > > > Is it fine to send discards as fast as device allows instead of doing 10 > > IOPS? Does IOPS limit consider a device wearing tradeoff? Then low IOPS > > makes sense. Or IOPS limit is just a way to reserve most bandwidth to > > non-discard workloads? Then I would say unlimited IOPS as a default > > would make more sense for btrfs. Unfortunately, btrfs currently doesn't have a "fully unlimited" async discard no matter how you tune it. Ignoring kbps_limit, which only serves to increase the delay, iops_limit has an effective range between 1 and 1000. The basic premise of btrfs async discard is to meter out the discards at a steady rate, asynchronously from file system operations, so the effect of the tunables is to set that delay between discard operations. The delay is clamped between 1ms and 1000ms, so iops_limit > 1000 is the same as iops_limit = 1000. iops_limit=0 does not lead to unmetered discards, but rather hits a hardcoded case of metering them out over 6 hours. (no clue why, I don't personally love that...) Hope that's somewhat helpful, Boris > > /me would be interested in answers to these questions as well > > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) > -- > Everything you wanna know about Linux kernel regression tracking: > https://linux-regtracking.leemhuis.info/about/#tldr > If I did something stupid, please tell me, as explained on that page. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async 2023-04-04 18:23 ` Boris Burkov @ 2023-04-04 19:12 ` Sergei Trofimovich 0 siblings, 0 replies; 27+ messages in thread From: Sergei Trofimovich @ 2023-04-04 19:12 UTC (permalink / raw) To: Boris Burkov Cc: Linux regression tracking (Thorsten Leemhuis), Christoph Hellwig, Josef Bacik, Christopher Price, anand.jain, clm, dsterba, linux-btrfs, regressions On Tue, 4 Apr 2023 11:23:12 -0700 Boris Burkov <boris@bur.io> wrote: > On Tue, Apr 04, 2023 at 12:49:40PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote: > > On 23.03.23 23:26, Sergei Trofimovich wrote: > > > On Wed, 22 Mar 2023 01:38:42 -0700 > > > Christoph Hellwig <hch@infradead.org> wrote: > > > > > >> On Tue, Mar 21, 2023 at 05:26:49PM -0400, Josef Bacik wrote: > > >>> We got the defaults based on our testing with our workloads inside of > > >>> FB. Clearly this isn't representative of a normal desktop usage, but > > >>> we've also got a lot of workloads so figured if it made the whole > > >>> fleet happy it would probably be fine everywhere. > > >>> > > >>> That being said this is tunable for a reason, your workload seems to > > >>> generate a lot of free'd extents and discards. We can probably mess > > >>> with the async stuff to maybe pause discarding if there's no other > > >>> activity happening on the device at the moment, but tuning it to let > > >>> more discards through at a time is also completely valid. Thanks, > > > > BTW, there is another report about this issue here: > > https://bugzilla.redhat.com/show_bug.cgi?id=2182228 > > > > /me wonders if there is a opensuse report as well, but a quick search > > didn't find one > > > > And as fun fact or for completeness, the issue even made it to reddit, too: > > https://www.reddit.com/r/archlinux/comments/121htxn/btrfs_discard_storm_on_62x_kernel/ > > Good find, but also: > https://www.reddit.com/r/Fedora/comments/vjfpkv/periodic_trim_freezes_ssd/ > So without harder data, there is a bit of bias inherent in cherrypicking > negative impressions from the internet. > > > > > >> FYI, discard performance differs a lot between different SSDs. > > >> It used to be pretty horrible for most devices early on, and then a > > >> certain hyperscaler started requiring decent performance for enterprise > > >> drives, so many of them are good now. A lot less so for the typical > > >> consumer drive, especially at the lower end of the spectrum. > > >> > > >> And that jut NVMe, the still shipping SATA SSDs are another different > > >> story. Not helped by the fact that we don't even support ranged > > >> discards for them in Linux. > > > > Thx for your comments Christoph. Quick question, just to be sure I > > understand things properly: > > > > I assume on some of those problematic devices these discard storms will > > lead to a performance regression? > > > > I also heard people saying these discard storms might reduce the life > > time of some devices - is that true? > > > > If the answer to at least one of these is "yes" I'd say we it might be > > best to revert 63a7cb130718 for now. > > > > > Josef, what did you use as a signal to detect what value was good > > > enough? Did you crank up the number until discard backlog clears up in a > > > reasonable time? > > Josef is OOO, so I'll try to clarify some things around async discard, > hopefully it's helpful to anyone wondering how to tune it. > > Like you guessed, our tuning basically consists of looking at the > discardable_extents/discardable_bytes metric in the fleet and ensuring > it looks sane, and that we see an improvement in I/O tail latencies or > fix some concrete instances of bad tail latencies. e.g. with > iops_limit=10, we see concrete cases of bad latency go away and don't > see a steady buildup of discardable_extents. > > > > > > > I still don't understand what I should take into account to change the > > > default and whether I should change it at all. Is it fine if the discard > > > backlog takes a week to get through it? (Or a day? An hour? A minute?) > > I believe the relevant metrics are: > > - number of trims issued/bytes trimmed (you would measure this by tracing > and by looking at discard_extent_bytes and discard_bitmap_bytes) > - bytes "wrongly" trimmed. (extents that were reallocated without getting > trimmed are exposed in discard_bytes_saved, so if that drops, you are > maybe trimming things that you may have not needed to) > - discardable_extents/discardable_bytes (in sysfs; the outstanding work) > - tail latency of file system operations > - disk idle time > > By doing periodic trim you tradeoff better bytes_saved and better disk > idle time (big trim once a week, vs. "trim all the time" against worse > tail latency during the trim itself, and risking trimming too > infrequently, leading to worse latency on a drive that needs a trim. > > > > > > > Is it fine to send discards as fast as device allows instead of doing 10 > > > IOPS? Does IOPS limit consider a device wearing tradeoff? Then low IOPS > > > makes sense. Or IOPS limit is just a way to reserve most bandwidth to > > > non-discard workloads? Then I would say unlimited IOPS as a default > > > would make more sense for btrfs. > > Unfortunately, btrfs currently doesn't have a "fully unlimited" async > discard no matter how you tune it. Ignoring kbps_limit, which only > serves to increase the delay, iops_limit has an effective range between > 1 and 1000. The basic premise of btrfs async discard is to meter out > the discards at a steady rate, asynchronously from file system > operations, so the effect of the tunables is to set that delay between > discard operations. The delay is clamped between 1ms and 1000ms, so > iops_limit > 1000 is the same as iops_limit = 1000. iops_limit=0 does > not lead to unmetered discards, but rather hits a hardcoded case of > metering them out over 6 hours. (no clue why, I don't personally love > that...) > > Hope that's somewhat helpful, Thank you, Boris! That is very helpful. `discard_bytes_saved` is a good (and not very obvious to understand!) signal. > Boris > > > > > /me would be interested in answers to these questions as well > > > > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) > > -- > > Everything you wanna know about Linux kernel regression tracking: > > https://linux-regtracking.leemhuis.info/about/#tldr > > If I did something stupid, please tell me, as explained on that page. -- Sergei ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <Y/+n1wS/4XAH7X1p@nz>]
* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async [not found] <Y/+n1wS/4XAH7X1p@nz> @ 2023-03-02 8:04 ` Linux regression tracking #adding (Thorsten Leemhuis) 2023-04-04 10:52 ` Linux regression tracking #update (Thorsten Leemhuis) 2023-04-21 13:56 ` Linux regression tracking #update (Thorsten Leemhuis) [not found] ` <94cf49d0-fa2d-cc2c-240e-222706d69eb3@oracle.com> 1 sibling, 2 replies; 27+ messages in thread From: Linux regression tracking #adding (Thorsten Leemhuis) @ 2023-03-02 8:04 UTC (permalink / raw) To: Sergei Trofimovich, linux-btrfs Cc: David Sterba, Boris Burkov, Chris Mason, Josef Bacik, Linux kernel regressions list [CCing the regression list, as it should be in the loop for regressions: https://docs.kernel.org/admin-guide/reporting-regressions.html] [TLDR: I'm adding this report to the list of tracked Linux kernel regressions; the text you find below is based on a few templates paragraphs you might have encountered already in similar form. See link in footer if these mails annoy you.] On 01.03.23 20:30, Sergei Trofimovich wrote: > Hi btrfs maintainers! > > Tl;DR: > > After 63a7cb13071842 "btrfs: auto enable discard=async when possible" I > see constant DISCARD storm towards my NVME device be it idle or not. > > No storm: v6.1 and older > Has storm: v6.2 and newer Thanks for the report. To be sure the issue doesn't fall through the cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression tracking bot: #regzbot ^introduced 63a7cb13071842 #regzbot title btrfs: DISCARD storm towards NVME device be it idle or not #regzbot ignore-activity This isn't a regression? This issue or a fix for it are already discussed somewhere else? It was fixed already? You want to clarify when the regression started to happen? Or point out I got the title or something else totally wrong? Then just reply and tell me -- ideally while also telling regzbot about it, as explained by the page listed in the footer of this mail. Developers: When fixing the issue, remember to add 'Link:' tags pointing to the report (the parent of this mail). See page linked in footer for details. Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr That page also explains what to do if mails like this annoy you. > More words: > > After upgrade from 6.1 to 6.2 I noticed that Disk led on my desktop > started flashing incessantly regardless of present or absent workload. > > I think I confirmed the storm with `perf`: led flashes align with output > of: > > # perf ftrace -a -T 'nvme_setup*' | cat > > kworker/6:1H-298 [006] 2569.645201: nvme_setup_cmd <-nvme_queue_rq > kworker/6:1H-298 [006] 2569.645205: nvme_setup_discard <-nvme_setup_cmd > kworker/6:1H-298 [006] 2569.749198: nvme_setup_cmd <-nvme_queue_rq > kworker/6:1H-298 [006] 2569.749202: nvme_setup_discard <-nvme_setup_cmd > kworker/6:1H-298 [006] 2569.853204: nvme_setup_cmd <-nvme_queue_rq > kworker/6:1H-298 [006] 2569.853209: nvme_setup_discard <-nvme_setup_cmd > kworker/6:1H-298 [006] 2569.958198: nvme_setup_cmd <-nvme_queue_rq > kworker/6:1H-298 [006] 2569.958202: nvme_setup_discard <-nvme_setup_cmd > > `iotop` shows no read/write IO at all (expected). > > I was able to bisect it down to this commit: > > $ git bisect good > 63a7cb13071842966c1ce931edacbc23573aada5 is the first bad commit > commit 63a7cb13071842966c1ce931edacbc23573aada5 > Author: David Sterba <dsterba@suse.com> > Date: Tue Jul 26 20:54:10 2022 +0200 > > btrfs: auto enable discard=async when possible > > There's a request to automatically enable async discard for capable > devices. We can do that, the async mode is designed to wait for larger > freed extents and is not intrusive, with limits to iops, kbps or latency. > > The status and tunables will be exported in /sys/fs/btrfs/FSID/discard . > > The automatic selection is done if there's at least one discard capable > device in the filesystem (not capable devices are skipped). Mounting > with any other discard option will honor that option, notably mounting > with nodiscard will keep it disabled. > > Link: https://lore.kernel.org/linux-btrfs/CAEg-Je_b1YtdsCR0zS5XZ_SbvJgN70ezwvRwLiCZgDGLbeMB=w@mail.gmail.com/ > Reviewed-by: Boris Burkov <boris@bur.io> > Signed-off-by: David Sterba <dsterba@suse.com> > > fs/btrfs/ctree.h | 1 + > fs/btrfs/disk-io.c | 14 ++++++++++++++ > fs/btrfs/super.c | 2 ++ > fs/btrfs/volumes.c | 3 +++ > fs/btrfs/volumes.h | 2 ++ > 5 files changed, 22 insertions(+) > > Is this storm a known issue? I did not dig too much into the patch. But > glancing at it this bit looks slightly off: > > + if (bdev_max_discard_sectors(bdev)) > + fs_devices->discardable = true; > > Is it expected that there is no `= false` assignment? > > This is the list of `btrfs` filesystems I have: > > $ cat /proc/mounts | fgrep btrfs > /dev/nvme0n1p3 / btrfs rw,noatime,compress=zstd:3,ssd,space_cache,subvolid=848,subvol=/nixos 0 0 > /dev/sda3 /mnt/archive btrfs rw,noatime,compress=zstd:3,space_cache,subvolid=5,subvol=/ 0 0 > # skipped bind mounts > > The device is: > > $ lspci | fgrep -i Solid > 01:00.0 Non-Volatile memory controller: ADATA Technology Co., Ltd. XPG SX8200 Pro PCIe Gen3x4 M.2 2280 Solid State Drive (rev 03) > > Can you help me debug the source of discards storm? > > Is it an expected discard storm? > > Is it problematic for SSD life span? > > Thank you! > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async 2023-03-02 8:04 ` Linux regression tracking #adding (Thorsten Leemhuis) @ 2023-04-04 10:52 ` Linux regression tracking #update (Thorsten Leemhuis) 2023-04-21 13:56 ` Linux regression tracking #update (Thorsten Leemhuis) 1 sibling, 0 replies; 27+ messages in thread From: Linux regression tracking #update (Thorsten Leemhuis) @ 2023-04-04 10:52 UTC (permalink / raw) To: linux-btrfs; +Cc: Linux kernel regressions list On 02.03.23 09:04, Linux regression tracking #adding (Thorsten Leemhuis) wrote: > On 01.03.23 20:30, Sergei Trofimovich wrote: >> Hi btrfs maintainers! >> >> Tl;DR: >> >> After 63a7cb13071842 "btrfs: auto enable discard=async when possible" I >> see constant DISCARD storm towards my NVME device be it idle or not. >> >> No storm: v6.1 and older >> Has storm: v6.2 and newer > > Thanks for the report. To be sure the issue doesn't fall through the > cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression > tracking bot: > > #regzbot ^introduced 63a7cb13071842 > #regzbot title btrfs: DISCARD storm towards NVME device be it idle or not > #regzbot ignore-activity #regzbot monitor: https://lore.kernel.org/all/CAHmG9huwQcQXvy3HS0OP9bKFxwUa3aQj9MXZCr74emn0U+efqQ@mail.gmail.com/ #regzbot ignore-activity Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr That page also explains what to do if mails like this annoy you. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async 2023-03-02 8:04 ` Linux regression tracking #adding (Thorsten Leemhuis) 2023-04-04 10:52 ` Linux regression tracking #update (Thorsten Leemhuis) @ 2023-04-21 13:56 ` Linux regression tracking #update (Thorsten Leemhuis) 1 sibling, 0 replies; 27+ messages in thread From: Linux regression tracking #update (Thorsten Leemhuis) @ 2023-04-21 13:56 UTC (permalink / raw) To: linux-btrfs; +Cc: Linux kernel regressions list [TLDR: This mail in primarily relevant for Linux kernel regression tracking. See link in footer if these mails annoy you.] On 02.03.23 09:04, Linux regression tracking #adding (Thorsten Leemhuis) wrote: > On 01.03.23 20:30, Sergei Trofimovich wrote: >> >> Tl;DR: >> >> After 63a7cb13071842 "btrfs: auto enable discard=async when possible" I >> see constant DISCARD storm towards my NVME device be it idle or not. >> >> No storm: v6.1 and older >> Has storm: v6.2 and newer > > Thanks for the report. To be sure the issue doesn't fall through the > cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression > tracking bot: > > #regzbot ^introduced 63a7cb13071842 > #regzbot title btrfs: DISCARD storm towards NVME device be it idle or not > #regzbot ignore-activity FWIW, the fix is in next now, regzbot just missed it due to a url encoding bug: #regzbot monitor: https://lore.kernel.org/linux-btrfs/cover.1680723651.git.boris@bur.io/T/#m21c1856ccc8fd24ccb6fbad7d9b54f0bae115d25 #regzbot fix: 2e55571fddf #regzbot ignore-activity Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr That page also explains what to do if mails like this annoy you. ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <94cf49d0-fa2d-cc2c-240e-222706d69eb3@oracle.com>]
[parent not found: <20230302105406.2cd367f7@nz>]
* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async [not found] ` <20230302105406.2cd367f7@nz> @ 2023-03-15 11:44 ` Linux regression tracking (Thorsten Leemhuis) 2023-03-15 16:34 ` Sergei Trofimovich 0 siblings, 1 reply; 27+ messages in thread From: Linux regression tracking (Thorsten Leemhuis) @ 2023-03-15 11:44 UTC (permalink / raw) To: Sergei Trofimovich, Anand Jain Cc: linux-btrfs, David Sterba, Boris Burkov, Chris Mason, Josef Bacik, Linux kernel regressions list Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting for once, to make this easily accessible to everyone. I added this to the tracking, but it seems nothing happened for nearly two weeks. Sergei, did you find a workaround that works for you? Or is this something that other people might run into as well and thus better should be fixed? Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr If I did something stupid, please tell me, as explained on that page. #regzbot poke On 02.03.23 11:54, Sergei Trofimovich wrote: > On Thu, 2 Mar 2023 17:12:27 +0800 > Anand Jain <anand.jain@oracle.com> wrote: > >> On 3/2/23 03:30, Sergei Trofimovich wrote: >>> Hi btrfs maintainers! >>> >>> Tl;DR: >>> >>> After 63a7cb13071842 "btrfs: auto enable discard=async when possible" I >>> see constant DISCARD storm towards my NVME device be it idle or not. >>> >>> No storm: v6.1 and older >>> Has storm: v6.2 and newer >>> >>> More words: >>> >>> After upgrade from 6.1 to 6.2 I noticed that Disk led on my desktop >>> started flashing incessantly regardless of present or absent workload. >>> >>> I think I confirmed the storm with `perf`: led flashes align with output >>> of: >>> >>> # perf ftrace -a -T 'nvme_setup*' | cat >>> >>> kworker/6:1H-298 [006] 2569.645201: nvme_setup_cmd <-nvme_queue_rq >>> kworker/6:1H-298 [006] 2569.645205: nvme_setup_discard <-nvme_setup_cmd >>> kworker/6:1H-298 [006] 2569.749198: nvme_setup_cmd <-nvme_queue_rq >>> kworker/6:1H-298 [006] 2569.749202: nvme_setup_discard <-nvme_setup_cmd >>> kworker/6:1H-298 [006] 2569.853204: nvme_setup_cmd <-nvme_queue_rq >>> kworker/6:1H-298 [006] 2569.853209: nvme_setup_discard <-nvme_setup_cmd >>> kworker/6:1H-298 [006] 2569.958198: nvme_setup_cmd <-nvme_queue_rq >>> kworker/6:1H-298 [006] 2569.958202: nvme_setup_discard <-nvme_setup_cmd >>> >>> `iotop` shows no read/write IO at all (expected). >>> >>> I was able to bisect it down to this commit: >>> >>> $ git bisect good >>> 63a7cb13071842966c1ce931edacbc23573aada5 is the first bad commit >>> commit 63a7cb13071842966c1ce931edacbc23573aada5 >>> Author: David Sterba <dsterba@suse.com> >>> Date: Tue Jul 26 20:54:10 2022 +0200 >>> >>> btrfs: auto enable discard=async when possible >>> >>> There's a request to automatically enable async discard for capable >>> devices. We can do that, the async mode is designed to wait for larger >>> freed extents and is not intrusive, with limits to iops, kbps or latency. >>> >>> The status and tunables will be exported in /sys/fs/btrfs/FSID/discard . >>> >>> The automatic selection is done if there's at least one discard capable >>> device in the filesystem (not capable devices are skipped). Mounting >>> with any other discard option will honor that option, notably mounting >>> with nodiscard will keep it disabled. >>> >>> Link: https://lore.kernel.org/linux-btrfs/CAEg-Je_b1YtdsCR0zS5XZ_SbvJgN70ezwvRwLiCZgDGLbeMB=w@mail.gmail.com/ >>> Reviewed-by: Boris Burkov <boris@bur.io> >>> Signed-off-by: David Sterba <dsterba@suse.com> >>> >>> fs/btrfs/ctree.h | 1 + >>> fs/btrfs/disk-io.c | 14 ++++++++++++++ >>> fs/btrfs/super.c | 2 ++ >>> fs/btrfs/volumes.c | 3 +++ >>> fs/btrfs/volumes.h | 2 ++ >>> 5 files changed, 22 insertions(+) >>> >>> Is this storm a known issue? I did not dig too much into the patch. But >>> glancing at it this bit looks slightly off: >>> >>> + if (bdev_max_discard_sectors(bdev)) >>> + fs_devices->discardable = true; >>> >>> Is it expected that there is no `= false` assignment? >>> >>> This is the list of `btrfs` filesystems I have: >>> >>> $ cat /proc/mounts | fgrep btrfs >>> /dev/nvme0n1p3 / btrfs rw,noatime,compress=zstd:3,ssd,space_cache,subvolid=848,subvol=/nixos 0 0 >>> /dev/sda3 /mnt/archive btrfs rw,noatime,compress=zstd:3,space_cache,subvolid=5,subvol=/ 0 0 >>> # skipped bind mounts >>> >> >> >> >>> The device is: >>> >>> $ lspci | fgrep -i Solid >>> 01:00.0 Non-Volatile memory controller: ADATA Technology Co., Ltd. XPG SX8200 Pro PCIe Gen3x4 M.2 2280 Solid State Drive (rev 03) >> >> >> It is a SSD device with NVME interface, that needs regular discard. >> Why not try tune io intensity using >> >> /sys/fs/btrfs/<uuid>/discard >> >> options? >> >> Maybe not all discardable sectors are not issued at once. It is a good >> idea to try with a fresh mkfs (which runs discard at mkfs) to see if >> discard is being issued even if there are no fs activities. > > Ah, thank you Anand! I poked a bit more in `perf ftrace` and I think I > see a "slow" pass through the discard backlog: > > /sys/fs/btrfs/<UUID>/discard$ cat iops_limit > 10 > > Twice a minute I get a short burst of file creates/deletes that produces > a bit of free space in many block groups. That enqueues hundreds of work > items. > > $ sudo perf ftrace -a -T 'btrfs_discard_workfn' -T 'btrfs_issue_discard' -T 'btrfs_discard_queue_work' > btrfs-transacti-407 [011] 42800.424027: btrfs_discard_queue_work <-__btrfs_add_free_space > btrfs-transacti-407 [011] 42800.424070: btrfs_discard_queue_work <-__btrfs_add_free_space > ... > btrfs-transacti-407 [011] 42800.425053: btrfs_discard_queue_work <-__btrfs_add_free_space > btrfs-transacti-407 [011] 42800.425055: btrfs_discard_queue_work <-__btrfs_add_free_space > > 193 entries of btrfs_discard_queue_work. > It took 1ms to enqueue all of the work into the workqueue. > > kworker/u64:1-2379115 [000] 42800.487010: btrfs_discard_workfn <-process_one_work > kworker/u64:1-2379115 [000] 42800.487028: btrfs_issue_discard <-btrfs_discard_extent > kworker/u64:1-2379115 [005] 42800.594010: btrfs_discard_workfn <-process_one_work > kworker/u64:1-2379115 [005] 42800.594031: btrfs_issue_discard <-btrfs_discard_extent > ... > kworker/u64:15-2396822 [007] 42830.441487: btrfs_discard_workfn <-process_one_work > kworker/u64:15-2396822 [007] 42830.441502: btrfs_issue_discard <-btrfs_discard_extent > kworker/u64:15-2396822 [000] 42830.546497: btrfs_discard_workfn <-process_one_work > kworker/u64:15-2396822 [000] 42830.546524: btrfs_issue_discard <-btrfs_discard_extent > > 286 pairs of btrfs_discard_workfn / btrfs_issue_discard. > Each pair takes 10ms to process, which seems to match iops_limit=10. > That means I can get about 300 discards per second max. > > btrfs-transacti-407 [002] 42830.634216: btrfs_discard_queue_work <-__btrfs_add_free_space > btrfs-transacti-407 [002] 42830.634228: btrfs_discard_queue_work <-__btrfs_add_free_space > ... > > Next transaction started 30 seconds later, which is a default commit > interval. > > My file system is of 512GB size. My guess I get about one discard entry > per block group on each > > Does my system keeps up with scheduled discard backlog? Can I peek at > workqueue size? > > Is iops_limit=10 a reasonable default for discard=async? It feels like > for larger file systems it will not be enough even for this idle state. > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async 2023-03-15 11:44 ` Linux regression tracking (Thorsten Leemhuis) @ 2023-03-15 16:34 ` Sergei Trofimovich 0 siblings, 0 replies; 27+ messages in thread From: Sergei Trofimovich @ 2023-03-15 16:34 UTC (permalink / raw) To: Linux regression tracking (Thorsten Leemhuis) Cc: Linux regressions mailing list, Anand Jain, linux-btrfs, David Sterba, Boris Burkov, Chris Mason, Josef Bacik [-- Attachment #1: Type: text/plain, Size: 8348 bytes --] On Wed, 15 Mar 2023 12:44:34 +0100 "Linux regression tracking (Thorsten Leemhuis)" <regressions@leemhuis.info> wrote: > Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting > for once, to make this easily accessible to everyone. > > I added this to the tracking, but it seems nothing happened for nearly > two weeks. > > Sergei, did you find a workaround that works for you? Or is this > something that other people might run into as well and thus better > should be fixed? I used the workaround of cranking up IOPS of discard from 10 to 1000: # echo 1000 > /sys//fs/btrfs/<UUID>/discard/iops_limit But I am not sure if it's a safe or a reasonable fix. I would prefer someone from btrfs to comment if it's an expected kernel's behaviour. > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) > -- > Everything you wanna know about Linux kernel regression tracking: > https://linux-regtracking.leemhuis.info/about/#tldr > If I did something stupid, please tell me, as explained on that page. > > #regzbot poke > > On 02.03.23 11:54, Sergei Trofimovich wrote: > > On Thu, 2 Mar 2023 17:12:27 +0800 > > Anand Jain <anand.jain@oracle.com> wrote: > > > >> On 3/2/23 03:30, Sergei Trofimovich wrote: > >>> Hi btrfs maintainers! > >>> > >>> Tl;DR: > >>> > >>> After 63a7cb13071842 "btrfs: auto enable discard=async when possible" I > >>> see constant DISCARD storm towards my NVME device be it idle or not. > >>> > >>> No storm: v6.1 and older > >>> Has storm: v6.2 and newer > >>> > >>> More words: > >>> > >>> After upgrade from 6.1 to 6.2 I noticed that Disk led on my desktop > >>> started flashing incessantly regardless of present or absent workload. > >>> > >>> I think I confirmed the storm with `perf`: led flashes align with output > >>> of: > >>> > >>> # perf ftrace -a -T 'nvme_setup*' | cat > >>> > >>> kworker/6:1H-298 [006] 2569.645201: nvme_setup_cmd <-nvme_queue_rq > >>> kworker/6:1H-298 [006] 2569.645205: nvme_setup_discard <-nvme_setup_cmd > >>> kworker/6:1H-298 [006] 2569.749198: nvme_setup_cmd <-nvme_queue_rq > >>> kworker/6:1H-298 [006] 2569.749202: nvme_setup_discard <-nvme_setup_cmd > >>> kworker/6:1H-298 [006] 2569.853204: nvme_setup_cmd <-nvme_queue_rq > >>> kworker/6:1H-298 [006] 2569.853209: nvme_setup_discard <-nvme_setup_cmd > >>> kworker/6:1H-298 [006] 2569.958198: nvme_setup_cmd <-nvme_queue_rq > >>> kworker/6:1H-298 [006] 2569.958202: nvme_setup_discard <-nvme_setup_cmd > >>> > >>> `iotop` shows no read/write IO at all (expected). > >>> > >>> I was able to bisect it down to this commit: > >>> > >>> $ git bisect good > >>> 63a7cb13071842966c1ce931edacbc23573aada5 is the first bad commit > >>> commit 63a7cb13071842966c1ce931edacbc23573aada5 > >>> Author: David Sterba <dsterba@suse.com> > >>> Date: Tue Jul 26 20:54:10 2022 +0200 > >>> > >>> btrfs: auto enable discard=async when possible > >>> > >>> There's a request to automatically enable async discard for capable > >>> devices. We can do that, the async mode is designed to wait for larger > >>> freed extents and is not intrusive, with limits to iops, kbps or latency. > >>> > >>> The status and tunables will be exported in /sys/fs/btrfs/FSID/discard . > >>> > >>> The automatic selection is done if there's at least one discard capable > >>> device in the filesystem (not capable devices are skipped). Mounting > >>> with any other discard option will honor that option, notably mounting > >>> with nodiscard will keep it disabled. > >>> > >>> Link: https://lore.kernel.org/linux-btrfs/CAEg-Je_b1YtdsCR0zS5XZ_SbvJgN70ezwvRwLiCZgDGLbeMB=w@mail.gmail.com/ > >>> Reviewed-by: Boris Burkov <boris@bur.io> > >>> Signed-off-by: David Sterba <dsterba@suse.com> > >>> > >>> fs/btrfs/ctree.h | 1 + > >>> fs/btrfs/disk-io.c | 14 ++++++++++++++ > >>> fs/btrfs/super.c | 2 ++ > >>> fs/btrfs/volumes.c | 3 +++ > >>> fs/btrfs/volumes.h | 2 ++ > >>> 5 files changed, 22 insertions(+) > >>> > >>> Is this storm a known issue? I did not dig too much into the patch. But > >>> glancing at it this bit looks slightly off: > >>> > >>> + if (bdev_max_discard_sectors(bdev)) > >>> + fs_devices->discardable = true; > >>> > >>> Is it expected that there is no `= false` assignment? > >>> > >>> This is the list of `btrfs` filesystems I have: > >>> > >>> $ cat /proc/mounts | fgrep btrfs > >>> /dev/nvme0n1p3 / btrfs rw,noatime,compress=zstd:3,ssd,space_cache,subvolid=848,subvol=/nixos 0 0 > >>> /dev/sda3 /mnt/archive btrfs rw,noatime,compress=zstd:3,space_cache,subvolid=5,subvol=/ 0 0 > >>> # skipped bind mounts > >>> > >> > >> > >> > >>> The device is: > >>> > >>> $ lspci | fgrep -i Solid > >>> 01:00.0 Non-Volatile memory controller: ADATA Technology Co., Ltd. XPG SX8200 Pro PCIe Gen3x4 M.2 2280 Solid State Drive (rev 03) > >> > >> > >> It is a SSD device with NVME interface, that needs regular discard. > >> Why not try tune io intensity using > >> > >> /sys/fs/btrfs/<uuid>/discard > >> > >> options? > >> > >> Maybe not all discardable sectors are not issued at once. It is a good > >> idea to try with a fresh mkfs (which runs discard at mkfs) to see if > >> discard is being issued even if there are no fs activities. > > > > Ah, thank you Anand! I poked a bit more in `perf ftrace` and I think I > > see a "slow" pass through the discard backlog: > > > > /sys/fs/btrfs/<UUID>/discard$ cat iops_limit > > 10 > > > > Twice a minute I get a short burst of file creates/deletes that produces > > a bit of free space in many block groups. That enqueues hundreds of work > > items. > > > > $ sudo perf ftrace -a -T 'btrfs_discard_workfn' -T 'btrfs_issue_discard' -T 'btrfs_discard_queue_work' > > btrfs-transacti-407 [011] 42800.424027: btrfs_discard_queue_work <-__btrfs_add_free_space > > btrfs-transacti-407 [011] 42800.424070: btrfs_discard_queue_work <-__btrfs_add_free_space > > ... > > btrfs-transacti-407 [011] 42800.425053: btrfs_discard_queue_work <-__btrfs_add_free_space > > btrfs-transacti-407 [011] 42800.425055: btrfs_discard_queue_work <-__btrfs_add_free_space > > > > 193 entries of btrfs_discard_queue_work. > > It took 1ms to enqueue all of the work into the workqueue. > > > > kworker/u64:1-2379115 [000] 42800.487010: btrfs_discard_workfn <-process_one_work > > kworker/u64:1-2379115 [000] 42800.487028: btrfs_issue_discard <-btrfs_discard_extent > > kworker/u64:1-2379115 [005] 42800.594010: btrfs_discard_workfn <-process_one_work > > kworker/u64:1-2379115 [005] 42800.594031: btrfs_issue_discard <-btrfs_discard_extent > > ... > > kworker/u64:15-2396822 [007] 42830.441487: btrfs_discard_workfn <-process_one_work > > kworker/u64:15-2396822 [007] 42830.441502: btrfs_issue_discard <-btrfs_discard_extent > > kworker/u64:15-2396822 [000] 42830.546497: btrfs_discard_workfn <-process_one_work > > kworker/u64:15-2396822 [000] 42830.546524: btrfs_issue_discard <-btrfs_discard_extent > > > > 286 pairs of btrfs_discard_workfn / btrfs_issue_discard. > > Each pair takes 10ms to process, which seems to match iops_limit=10. > > That means I can get about 300 discards per second max. > > > > btrfs-transacti-407 [002] 42830.634216: btrfs_discard_queue_work <-__btrfs_add_free_space > > btrfs-transacti-407 [002] 42830.634228: btrfs_discard_queue_work <-__btrfs_add_free_space > > ... > > > > Next transaction started 30 seconds later, which is a default commit > > interval. > > > > My file system is of 512GB size. My guess I get about one discard entry > > per block group on each > > > > Does my system keeps up with scheduled discard backlog? Can I peek at > > workqueue size? > > > > Is iops_limit=10 a reasonable default for discard=async? It feels like > > for larger file systems it will not be enough even for this idle state. > > -- Sergei [-- Attachment #2: Цифровая подпись OpenPGP --] [-- Type: application/pgp-signature, Size: 981 bytes --] ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2023-04-21 13:56 UTC | newest] Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-03-20 22:40 [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async Christopher Price 2023-03-21 21:26 ` Josef Bacik 2023-03-22 8:38 ` Christoph Hellwig 2023-03-23 22:26 ` Sergei Trofimovich 2023-04-04 10:49 ` Linux regression tracking (Thorsten Leemhuis) 2023-04-04 16:04 ` Christoph Hellwig 2023-04-04 16:20 ` Roman Mamedov 2023-04-04 16:27 ` Christoph Hellwig 2023-04-04 23:37 ` Damien Le Moal 2023-04-04 18:15 ` Chris Mason 2023-04-04 18:51 ` Boris Burkov 2023-04-04 19:22 ` David Sterba 2023-04-04 19:39 ` Boris Burkov 2023-04-05 8:17 ` Linux regression tracking (Thorsten Leemhuis) 2023-04-10 2:03 ` Michael Bromilow 2023-04-11 17:52 ` David Sterba 2023-04-11 18:15 ` Linux regression tracking (Thorsten Leemhuis) 2023-04-04 19:08 ` Sergei Trofimovich 2023-04-05 6:18 ` Christoph Hellwig 2023-04-05 12:01 ` Chris Mason 2023-04-04 18:23 ` Boris Burkov 2023-04-04 19:12 ` Sergei Trofimovich [not found] <Y/+n1wS/4XAH7X1p@nz> 2023-03-02 8:04 ` Linux regression tracking #adding (Thorsten Leemhuis) 2023-04-04 10:52 ` Linux regression tracking #update (Thorsten Leemhuis) 2023-04-21 13:56 ` Linux regression tracking #update (Thorsten Leemhuis) [not found] ` <94cf49d0-fa2d-cc2c-240e-222706d69eb3@oracle.com> [not found] ` <20230302105406.2cd367f7@nz> 2023-03-15 11:44 ` Linux regression tracking (Thorsten Leemhuis) 2023-03-15 16:34 ` Sergei Trofimovich
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).