All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Javier González" <javier.gonz@samsung.com>
To: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	Theodore Ts'o <tytso@mit.edu>, Hannes Reinecke <hare@suse.de>,
	Luis Chamberlain <mcgrof@kernel.org>,
	Keith Busch <kbusch@kernel.org>,
	Pankaj Raghav <p.raghav@samsung.com>,
	Daniel Gomez <da.gomez@samsung.com>,
	<lsf-pc@lists.linux-foundation.org>,
	<linux-fsdevel@vger.kernel.org>, <linux-mm@kvack.org>,
	<linux-block@vger.kernel.org>
Subject: Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
Date: Fri, 10 Mar 2023 08:59:28 +0100	[thread overview]
Message-ID: <20230310075928.zcuiep3f2vpxbfdo@ArmHalley.local> (raw)
In-Reply-To: <260064c68b61f4a7bc49f09499e1c107e2a28f31.camel@HansenPartnership.com>

On 09.03.2023 08:11, James Bottomley wrote:
>On Thu, 2023-03-09 at 09:04 +0100, Javier González wrote:
>> On 08.03.2023 13:13, James Bottomley wrote:
>> > On Wed, 2023-03-08 at 17:53 +0000, Matthew Wilcox wrote:
>> > > On Mon, Mar 06, 2023 at 11:12:14AM -0500, Theodore Ts'o wrote:
>> > > > What HDD vendors want is to be able to have 32k or even 64k
>> > > > *physical* sector sizes.  This allows for much more efficient
>> > > > erasure codes, so it will increase their byte capacity now that
>> > > > it's no longer easier to get capacity boosts by squeezing the
>> > > > tracks closer and closer, and their have been various
>> > > > engineering tradeoffs with SMR, HAMR, and MAMR.  HDD vendors
>> > > > have been asking for this at LSF/MM, and in othervenues for
>> > > > ***years***.
>> > >
>> > > I've been reminded by a friend who works on the drive side that a
>> > > motivation for the SSD vendors is (essentially) the size of
>> > > sector_t. Once the drive needs to support more than 2/4 billion
>> > > sectors, they need to move to a 64-bit sector size, so the amount
>> > > of memory consumed by the FTL doubles, the CPU data cache becomes
>> > > half as effective, etc. That significantly increases the BOM for
>> > > the drive, and so they have to charge more.  With a 512-byte LBA,
>> > > that's 2TB; with a 4096-byte LBA, it's at 16TB and with a 64k
>> > > LBA, they can keep using 32-bit LBA numbers all the way up to
>> > > 256TB.
>> >
>> > I thought the FTL operated on physical sectors and the logical to
>> > physical was done as a RMW through the FTL?  In which case sector_t
>> > shouldn't matter to the SSD vendors for FTL management because they
>> > can keep the logical sector size while increasing the physical one.
>> > Obviously if physical size goes above the FS block size, the drives
>> > will behave suboptimally with RMWs, which is why 4k physical is the
>> > max currently.
>> >
>>
>> FTL designs are complex. We have ways to maintain sector sizes under
>> 64 bits, but this is a common industry problem.
>>
>> The media itself does not normally oeprate at 4K. Page siges can be
>> 16K, 32K, etc.
>
>Right, and we've always said if we knew what this size was we could
>make better block write decisions.  However, today if you look what
>most NVMe devices are reporting, it's a bit sub-optimal:
>
>jejb@lingrow:/sys/block/nvme1n1/queue> cat logical_block_size
>512
>jejb@lingrow:/sys/block/nvme1n1/queue> cat physical_block_size
>512
>jejb@lingrow:/sys/block/nvme1n1/queue> cat optimal_io_size
>0
>
>If we do get Linux to support large block sizes, are we actually going
>to get better information out of the devices?

We already have this through the NVMe Optimal Performance parameters
(see Dan's response for this). Note that these values are already
implemented in the kernel. If I recall properly, Bart was the one doing
this work.

More over, from the vendor side, it is a challenge to expose larger LBAs
without wide support in OSs. I am confident that if we are pushing for
this work and we see it fits existing FSs, we will see vendors exposing
new LBA formats in the beginning (same as we have 512b and 4K in the
same drive), and eventually focusing only on larger LBA sizes.

>
>>  Increasing the block size would allow for better host/device
>> cooperation. As Ted mentions, this has been a requirement for HDD and
>> SSD vendor for years. It seems to us that the time is right now and
>> that we have mechanisms in Linux to do the plumbing. Folios is
>> ovbiously a big part of this.
>
>Well a decade ago we did a lot of work to support 4k sector devices.
>Ultimately the industry went with 512 logical/4k physical devices
>because of problems with non-Linux proprietary OSs but you could still
>use 4k today if you wanted (I've actually still got a working 4k SCSI
>drive), so why is no NVMe device doing that?

Most NVMe devices report 4K today. Actually 512b is mostly an
optimization targeted at read-heavy workloads.

>
>This is not to say I think larger block sizes is in any way a bad idea
>... I just think that given the history, it will be driven by
>application needs rather than what the manufacturers tell us.

I see more and more that this deserves a session at LSF/MM

  parent reply	other threads:[~2023-03-10  7:59 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-01  3:52 [LSF/MM/BPF TOPIC] Cloud storage optimizations Theodore Ts'o
2023-03-01  4:18 ` Gao Xiang
2023-03-01  4:40   ` Matthew Wilcox
2023-03-01  4:59     ` Gao Xiang
2023-03-01  4:35 ` Matthew Wilcox
2023-03-01  4:49   ` Gao Xiang
2023-03-01  5:01     ` Matthew Wilcox
2023-03-01  5:09       ` Gao Xiang
2023-03-01  5:19         ` Gao Xiang
2023-03-01  5:42         ` Matthew Wilcox
2023-03-01  5:51           ` Gao Xiang
2023-03-01  6:00             ` Gao Xiang
2023-03-02  3:13 ` Chaitanya Kulkarni
2023-03-02  3:50 ` Darrick J. Wong
2023-03-03  3:03   ` Martin K. Petersen
2023-03-02 20:30 ` Bart Van Assche
2023-03-03  3:05   ` Martin K. Petersen
2023-03-03  1:58 ` Keith Busch
2023-03-03  3:49   ` Matthew Wilcox
2023-03-03 11:32     ` Hannes Reinecke
2023-03-03 13:11     ` James Bottomley
2023-03-04  7:34       ` Matthew Wilcox
2023-03-04 13:41         ` James Bottomley
2023-03-04 16:39           ` Matthew Wilcox
2023-03-05  4:15             ` Luis Chamberlain
2023-03-05  5:02               ` Matthew Wilcox
2023-03-08  6:11                 ` Luis Chamberlain
2023-03-08  7:59                   ` Dave Chinner
2023-03-06 12:04               ` Hannes Reinecke
2023-03-06  3:50             ` James Bottomley
2023-03-04 19:04         ` Luis Chamberlain
2023-03-03 21:45     ` Luis Chamberlain
2023-03-03 22:07       ` Keith Busch
2023-03-03 22:14         ` Luis Chamberlain
2023-03-03 22:32           ` Keith Busch
2023-03-03 23:09             ` Luis Chamberlain
2023-03-16 15:29             ` Pankaj Raghav
2023-03-16 15:41               ` Pankaj Raghav
2023-03-03 23:51       ` Bart Van Assche
2023-03-04 11:08       ` Hannes Reinecke
2023-03-04 13:24         ` Javier González
2023-03-04 16:47         ` Matthew Wilcox
2023-03-04 17:17           ` Hannes Reinecke
2023-03-04 17:54             ` Matthew Wilcox
2023-03-04 18:53               ` Luis Chamberlain
2023-03-05  3:06               ` Damien Le Moal
2023-03-05 11:22               ` Hannes Reinecke
2023-03-06  8:23                 ` Matthew Wilcox
2023-03-06 10:05                   ` Hannes Reinecke
2023-03-06 16:12                   ` Theodore Ts'o
2023-03-08 17:53                     ` Matthew Wilcox
2023-03-08 18:13                       ` James Bottomley
2023-03-09  8:04                         ` Javier González
2023-03-09 13:11                           ` James Bottomley
2023-03-09 14:05                             ` Keith Busch
2023-03-09 15:23                             ` Martin K. Petersen
2023-03-09 20:49                               ` James Bottomley
2023-03-09 21:13                                 ` Luis Chamberlain
2023-03-09 21:28                                   ` Martin K. Petersen
2023-03-10  1:16                                     ` Dan Helmick
2023-03-10  7:59                             ` Javier González [this message]
2023-03-08 19:35                 ` Luis Chamberlain
2023-03-08 19:55                 ` Bart Van Assche
2023-03-03  2:54 ` Martin K. Petersen
2023-03-03  3:29   ` Keith Busch
2023-03-03  4:20   ` Theodore Ts'o
2023-07-16  4:09 BELINDA Goodpaster kelly

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230310075928.zcuiep3f2vpxbfdo@ArmHalley.local \
    --to=javier.gonz@samsung.com \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=da.gomez@samsung.com \
    --cc=hare@suse.de \
    --cc=kbusch@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mcgrof@kernel.org \
    --cc=p.raghav@samsung.com \
    --cc=tytso@mit.edu \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.