linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Martin K. Petersen" <martin.petersen@oracle.com>
To: Roman Mamedov <rm@romanrm.net>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
	Jeff Mahoney <jeffm@suse.com>,
	Keith Busch <keith.busch@intel.com>,
	Ric Wheeler <ricwheeler@gmail.com>,
	Dave Chinner <david@fromorbit.com>,
	lsf-pc@lists.linux-foundation.org,
	linux-xfs <linux-xfs@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-ext4 <linux-ext4@vger.kernel.org>,
	linux-btrfs <linux-btrfs@vger.kernel.org>,
	linux-block@vger.kernel.org
Subject: Re: [LSF/MM TOPIC] More async operations for file systems - async discard?
Date: Fri, 22 Feb 2019 09:12:44 -0500	[thread overview]
Message-ID: <yq136ofj4ir.fsf@oracle.com> (raw)
In-Reply-To: <20190222111532.4ead81dc@natsu> (Roman Mamedov's message of "Fri, 22 Feb 2019 11:15:32 +0500")


Roman,

>> Consequently, many of the modern devices that claim to support
>> discard to make us software folks happy (or to satisfy a purchase
>> order requirements) complete the commands without doing anything at
>> all.  We're simply wasting queue slots.
>
> Any example of such devices? Let alone "many"? Where you would issue a
> full-device blkdiscard, but then just read back old data.

I obviously can't mention names or go into implementation details. But
there are many drives out there that return old data. And that's
perfectly within spec.

At least some of the pain in the industry in this department can be
attributed to us Linux folks and RAID device vendors. We all wanted
deterministic zeroes on completion of DSM TRIM, UNMAP, or DEALLOCATE.
The device vendors weren't happy about that and we ended up with weasel
language in the specs. This lead to the current libata whitelist mess
for SATA SSDs and ongoing vendor implementation confusion in SCSI and
NVMe devices.

On the Linux side the problem was that we originally used discard for
two distinct purposes: Clearing block ranges and deallocating block
ranges. We cleaned that up a while back and now have BLKZEROOUT and
BLKDISCARD. Those operations get translated to different operations
depending on the device. We also cleaned up several of the
inconsistencies in the SCSI and NVMe specs to facilitate making this
distinction possible in the kernel.

In the meantime the SSD vendors made great strides in refining their
flash management. To the point where pretty much all enterprise device
vendors will ask you not to issue discards. The benefits simply do not
outweigh the costs.

If you have special workloads where write amplification is a major
concern it may still be advantageous to do the discards and reduce WA
and prolong drive life. However, these workloads are increasingly moving
away from the classic LBA read/write model. Open Channel originally
targeted this space. Right now work is underway on Zoned Namespaces and
Key-Value command sets in NVMe.

These curated application workload protocols are fundamental departures
from the traditional way of accessing storage. And my postulate is that
where tail latency and drive lifetime management is important, those new
command sets offer much better bang for the buck. And they make the
notion of discard completely moot. That's why I don't think it's going
to be terribly important in the long term.

This leaves consumer devices and enterprise devices using the
traditional LBA I/O model.

For consumer devices I still think fstrim is a good compromise. Lack of
queuing for DSM hurt us for a long time. And when it was finally added
to the ATA command set, many device vendors got their implementations
wrong. So it sucked for a lot longer than it should have. And of course
FTL implementations differ.

For enterprise devices we're still in the situation where vendors
generally prefer for us not to use discard. I would love for the
DEALLOCATE/WRITE ZEROES mess to be sorted out in their FTLs, but I have
fairly low confidence that it's going to happen. Case in point: Despite
a lot of leverage and purchasing power, the cloud industry has not been
terribly successful in compelling the drive manufacturers to make
DEALLOCATE perform well for typical application workloads. So I'm not
holding my breath...

-- 
Martin K. Petersen	Oracle Linux Engineering

  reply	other threads:[~2019-02-22 14:13 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-17 20:36 [LSF/MM TOPIC] More async operations for file systems - async discard? Ric Wheeler
2019-02-17 21:09 ` Dave Chinner
2019-02-17 23:42   ` Ric Wheeler
2019-02-18  2:22     ` Dave Chinner
2019-02-18 22:30       ` Ric Wheeler
2019-02-20 23:47     ` Keith Busch
2019-02-21 20:08       ` Dave Chinner
2019-02-21 23:55       ` Jeff Mahoney
2019-02-22  3:01         ` Martin K. Petersen
2019-02-22  6:15           ` Roman Mamedov
2019-02-22 14:12             ` Martin K. Petersen [this message]
2019-02-22  2:51       ` Martin K. Petersen
2019-02-22 16:45         ` Keith Busch
2019-02-27 11:40           ` Ric Wheeler
2019-02-27 13:24           ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=yq136ofj4ir.fsf@oracle.com \
    --to=martin.petersen@oracle.com \
    --cc=david@fromorbit.com \
    --cc=jeffm@suse.com \
    --cc=keith.busch@intel.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=ricwheeler@gmail.com \
    --cc=rm@romanrm.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).