All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Theodore Ts'o" <tytso@mit.edu>
To: Matthew Wilcox <willy@infradead.org>
Cc: "Hannes Reinecke" <hare@suse.de>,
	"Luis Chamberlain" <mcgrof@kernel.org>,
	"Keith Busch" <kbusch@kernel.org>,
	"Pankaj Raghav" <p.raghav@samsung.com>,
	"Daniel Gomez" <da.gomez@samsung.com>,
	"Javier González" <javier.gonz@samsung.com>,
	lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-block@vger.kernel.org
Subject: Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
Date: Mon, 6 Mar 2023 11:12:14 -0500	[thread overview]
Message-ID: <20230306161214.GB959362@mit.edu> (raw)
In-Reply-To: <ZAWi5KwrsYL+0Uru@casper.infradead.org>

On Mon, Mar 06, 2023 at 08:23:00AM +0000, Matthew Wilcox wrote:
> 
> All current filesystems that I'm aware of require their fs block size
> to be >= LBA size.  That is, you can't take a 512-byte blocksize ext2
> filesystem and put it on a 4kB LBA storage device.
> 
> That means that files can only grow/shrink in 256MB increments.  I
> don't think that amount of wasted space is going to be acceptable.
> So if we're serious about going down this path, we need to tell
> filesystem people to start working out how to support fs block
> size < LBA size.
> 
> That's a big ask, so let's be sure storage vendors actually want
> this.  Both supporting zoned devices & suporting 16k/64k block
> sizes are easier asks.

What HDD vendors want is to be able to have 32k or even 64k *physical*
sector sizes.  This allows for much more efficient erasure codes, so
it will increase their byte capacity now that it's no longer easier to
get capacity boosts by squeezing the tracks closer and closer, and
their have been various engineering tradeoffs with SMR, HAMR, and
MAMR.  HDD vendors have been asking for this at LSF/MM, and in other
venues for ***years***.

This doesn't necessarily mean that the *logical* sector size needs to
be larger.  What I could imagine that HDD vendors could do is to
create HDD disks with, say, a 4k logical block size and a 32k physical
sector size.  This means that 4k random writes will require
read/modify/write cycles, which isn't great from a performance
performance.  However, for those customers who are using raw block
devices for their cluster file system, and for those customers who are
willing to, say, use ext4 with a 4k block size and a 32k cluster size
(using the bigalloc feature), all of the data blocks would be 32k
aligned, and this would work without any modifications.

I suspect that if these drives were made available, this would allow
for a gradual transition to support larger block sizes.  The file
system level changes aren't *that* hard.  There is a chicken and egg
situation here; until these drives are generally available, the
incentive to do the work is minimal.  But with a 4k logical, 32k or
64k physical sector size, we can gradually improve our support for
these file systems with block size > page size, with cluster size >
page size being an intermediate step that would work today.

Cheers,

					- Ted

  parent reply	other threads:[~2023-03-06 16:17 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-01  3:52 [LSF/MM/BPF TOPIC] Cloud storage optimizations Theodore Ts'o
2023-03-01  4:18 ` Gao Xiang
2023-03-01  4:40   ` Matthew Wilcox
2023-03-01  4:59     ` Gao Xiang
2023-03-01  4:35 ` Matthew Wilcox
2023-03-01  4:49   ` Gao Xiang
2023-03-01  5:01     ` Matthew Wilcox
2023-03-01  5:09       ` Gao Xiang
2023-03-01  5:19         ` Gao Xiang
2023-03-01  5:42         ` Matthew Wilcox
2023-03-01  5:51           ` Gao Xiang
2023-03-01  6:00             ` Gao Xiang
2023-03-02  3:13 ` Chaitanya Kulkarni
2023-03-02  3:50 ` Darrick J. Wong
2023-03-03  3:03   ` Martin K. Petersen
2023-03-02 20:30 ` Bart Van Assche
2023-03-03  3:05   ` Martin K. Petersen
2023-03-03  1:58 ` Keith Busch
2023-03-03  3:49   ` Matthew Wilcox
2023-03-03 11:32     ` Hannes Reinecke
2023-03-03 13:11     ` James Bottomley
2023-03-04  7:34       ` Matthew Wilcox
2023-03-04 13:41         ` James Bottomley
2023-03-04 16:39           ` Matthew Wilcox
2023-03-05  4:15             ` Luis Chamberlain
2023-03-05  5:02               ` Matthew Wilcox
2023-03-08  6:11                 ` Luis Chamberlain
2023-03-08  7:59                   ` Dave Chinner
2023-03-06 12:04               ` Hannes Reinecke
2023-03-06  3:50             ` James Bottomley
2023-03-04 19:04         ` Luis Chamberlain
2023-03-03 21:45     ` Luis Chamberlain
2023-03-03 22:07       ` Keith Busch
2023-03-03 22:14         ` Luis Chamberlain
2023-03-03 22:32           ` Keith Busch
2023-03-03 23:09             ` Luis Chamberlain
2023-03-16 15:29             ` Pankaj Raghav
2023-03-16 15:41               ` Pankaj Raghav
2023-03-03 23:51       ` Bart Van Assche
2023-03-04 11:08       ` Hannes Reinecke
2023-03-04 13:24         ` Javier González
2023-03-04 16:47         ` Matthew Wilcox
2023-03-04 17:17           ` Hannes Reinecke
2023-03-04 17:54             ` Matthew Wilcox
2023-03-04 18:53               ` Luis Chamberlain
2023-03-05  3:06               ` Damien Le Moal
2023-03-05 11:22               ` Hannes Reinecke
2023-03-06  8:23                 ` Matthew Wilcox
2023-03-06 10:05                   ` Hannes Reinecke
2023-03-06 16:12                   ` Theodore Ts'o [this message]
2023-03-08 17:53                     ` Matthew Wilcox
2023-03-08 18:13                       ` James Bottomley
2023-03-09  8:04                         ` Javier González
2023-03-09 13:11                           ` James Bottomley
2023-03-09 14:05                             ` Keith Busch
2023-03-09 15:23                             ` Martin K. Petersen
2023-03-09 20:49                               ` James Bottomley
2023-03-09 21:13                                 ` Luis Chamberlain
2023-03-09 21:28                                   ` Martin K. Petersen
2023-03-10  1:16                                     ` Dan Helmick
2023-03-10  7:59                             ` Javier González
2023-03-08 19:35                 ` Luis Chamberlain
2023-03-08 19:55                 ` Bart Van Assche
2023-03-03  2:54 ` Martin K. Petersen
2023-03-03  3:29   ` Keith Busch
2023-03-03  4:20   ` Theodore Ts'o
2023-07-16  4:09 BELINDA Goodpaster kelly

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230306161214.GB959362@mit.edu \
    --to=tytso@mit.edu \
    --cc=da.gomez@samsung.com \
    --cc=hare@suse.de \
    --cc=javier.gonz@samsung.com \
    --cc=kbusch@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mcgrof@kernel.org \
    --cc=p.raghav@samsung.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.