All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Theodore Ts'o" <tytso@mit.edu>
To: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-block@vger.kernel.org
Subject: Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
Date: Thu, 2 Mar 2023 23:20:26 -0500	[thread overview]
Message-ID: <ZAF1iuaXTJEvOe5c@mit.edu> (raw)
In-Reply-To: <yq1356mh925.fsf@ca-mkp.ca.oracle.com>

On Thu, Mar 02, 2023 at 09:54:59PM -0500, Martin K. Petersen wrote:
> 
> Hi Ted!
> 
> > With NVMe, it is possible for a storage device to promise this without
> > requiring read-modify-write updates for sub-16k writes.  All that is
> > necessary are some changes in the block layer so that the kernel does
> > not inadvertently tear a write request when splitting a bio because it
> > is too large (perhaps because it got merged with some other request,
> > and then it gets split at an inconvenient boundary).
> 
> We have been working on support for atomic writes and it is not a simple
> as it sounds. Atomic operations in SCSI and NVMe have semantic
> differences which are challenging to reconcile. On top of that, both the
> SCSI and NVMe specs are buggy in the atomics department. We are working
> to get things fixed in both standards and aim to discuss our
> implementation at LSF/MM.

I'd be very interested to learn more about what you've found.  I know
more than one cloud provider is thinking about how to use the NVMe
spec to send information about how their emulated block device work.
This has come up at our weekly ext4 video conference, and given that I
gave a talk about it in 2018[1], there's quite a lot of similarity of
what folks are thinking about.  Basically, MySQL and Postgres use 16k
database pages, and if we can avoid their special doublewrite
techniques to avoid torn writes, because they can depend on their
Cloud Block Devices Working A Certain Way, it can make for very
noticeable performance improvements.

[1] https://www.youtube.com/watch?v=gIeuiGg-_iw

So while the standards might allow standards-compliant physical
devices to do some really wierd sh*t, it might be that if all cloud
vendors do things in the same way, I could see various cloud workloads
starting to depending on extra-standard behaviour, much like a lot of
sysadmins assume that low-numbered LBA's are on the outer diamenter of
the HDD and are much more performant than sectors on the i.d. of the
HDD.  This is completely not guaranteed by the standard specs, but
it's become a defacto standard.

That's not a great place to be, and it would be great if can find ways
that are much more reliable in terms of querying a standards-compliant
storage device and knowing whether we can depend on a certain behavior
--- but I also know how slowly storage standards bodies move.  :-(

> Hinting didn't see widespread adoption because we in Linux, as well as
> the various interested databases, preferred hints to be per-I/O
> properties. Whereas $OTHER_OS insisted that hints should be statically
> assigned to LBA ranges on media. This left vendors having to choose
> between two very different approaches and consequently they chose not to
> support any of them.

I wasn't aware of that history.  Thanks for filling that bit in.

Fortunately, in 2023, it appears that for many cloud vendors, the
teams involved care a lot more about Linux than $OTHER_OS.  So
hopefully we'll have a lot more success in getting write hints
generally available to hyperscale cloud customers.

From an industry-wide perspective, it would be useful if the write
hints used by Hyperscale Cloud Vendor #1 are very similar to what
write hints are supported by Hyperscale Cloud Vendor #2.  Standards
committees aren't the only way that companies can collaborate in an
anti-trust compliant way.  Open source is another way; and especially
if we can show that a set of hints work well for the Linux kernel and
Linux applications ---- then what we ship in the Linux kernel can help
shape the set of "write hints" that cloud storage devices will
support.

					- Ted

P.S.  From a LSF/MM program perspective, I suspect we may want to have
more than one session; one that is focused on standards and atomic
writes, and another that is focused on write hints.  The first might
be mostly block and fs focused, and the second would probably be of
interest to mm folks as well.

  parent reply	other threads:[~2023-03-03  4:20 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-01  3:52 [LSF/MM/BPF TOPIC] Cloud storage optimizations Theodore Ts'o
2023-03-01  4:18 ` Gao Xiang
2023-03-01  4:40   ` Matthew Wilcox
2023-03-01  4:59     ` Gao Xiang
2023-03-01  4:35 ` Matthew Wilcox
2023-03-01  4:49   ` Gao Xiang
2023-03-01  5:01     ` Matthew Wilcox
2023-03-01  5:09       ` Gao Xiang
2023-03-01  5:19         ` Gao Xiang
2023-03-01  5:42         ` Matthew Wilcox
2023-03-01  5:51           ` Gao Xiang
2023-03-01  6:00             ` Gao Xiang
2023-03-02  3:13 ` Chaitanya Kulkarni
2023-03-02  3:50 ` Darrick J. Wong
2023-03-03  3:03   ` Martin K. Petersen
2023-03-02 20:30 ` Bart Van Assche
2023-03-03  3:05   ` Martin K. Petersen
2023-03-03  1:58 ` Keith Busch
2023-03-03  3:49   ` Matthew Wilcox
2023-03-03 11:32     ` Hannes Reinecke
2023-03-03 13:11     ` James Bottomley
2023-03-04  7:34       ` Matthew Wilcox
2023-03-04 13:41         ` James Bottomley
2023-03-04 16:39           ` Matthew Wilcox
2023-03-05  4:15             ` Luis Chamberlain
2023-03-05  5:02               ` Matthew Wilcox
2023-03-08  6:11                 ` Luis Chamberlain
2023-03-08  7:59                   ` Dave Chinner
2023-03-06 12:04               ` Hannes Reinecke
2023-03-06  3:50             ` James Bottomley
2023-03-04 19:04         ` Luis Chamberlain
2023-03-03 21:45     ` Luis Chamberlain
2023-03-03 22:07       ` Keith Busch
2023-03-03 22:14         ` Luis Chamberlain
2023-03-03 22:32           ` Keith Busch
2023-03-03 23:09             ` Luis Chamberlain
2023-03-16 15:29             ` Pankaj Raghav
2023-03-16 15:41               ` Pankaj Raghav
2023-03-03 23:51       ` Bart Van Assche
2023-03-04 11:08       ` Hannes Reinecke
2023-03-04 13:24         ` Javier González
2023-03-04 16:47         ` Matthew Wilcox
2023-03-04 17:17           ` Hannes Reinecke
2023-03-04 17:54             ` Matthew Wilcox
2023-03-04 18:53               ` Luis Chamberlain
2023-03-05  3:06               ` Damien Le Moal
2023-03-05 11:22               ` Hannes Reinecke
2023-03-06  8:23                 ` Matthew Wilcox
2023-03-06 10:05                   ` Hannes Reinecke
2023-03-06 16:12                   ` Theodore Ts'o
2023-03-08 17:53                     ` Matthew Wilcox
2023-03-08 18:13                       ` James Bottomley
2023-03-09  8:04                         ` Javier González
2023-03-09 13:11                           ` James Bottomley
2023-03-09 14:05                             ` Keith Busch
2023-03-09 15:23                             ` Martin K. Petersen
2023-03-09 20:49                               ` James Bottomley
2023-03-09 21:13                                 ` Luis Chamberlain
2023-03-09 21:28                                   ` Martin K. Petersen
2023-03-10  1:16                                     ` Dan Helmick
2023-03-10  7:59                             ` Javier González
2023-03-08 19:35                 ` Luis Chamberlain
2023-03-08 19:55                 ` Bart Van Assche
2023-03-03  2:54 ` Martin K. Petersen
2023-03-03  3:29   ` Keith Busch
2023-03-03  4:20   ` Theodore Ts'o [this message]
2023-07-16  4:09 BELINDA Goodpaster kelly

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZAF1iuaXTJEvOe5c@mit.edu \
    --to=tytso@mit.edu \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=martin.petersen@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.