All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] Cloud storage optimizations
@ 2023-03-01  3:52 Theodore Ts'o
  2023-03-01  4:18 ` Gao Xiang
                   ` (6 more replies)
  0 siblings, 7 replies; 67+ messages in thread
From: Theodore Ts'o @ 2023-03-01  3:52 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-fsdevel, linux-mm, linux-block

Emulated block devices offered by cloud VM’s can provide functionality
to guest kernels and applications that traditionally have not been
available to users of consumer-grade HDD and SSD’s.  For example,
today it’s possible to create a block device in Google’s Persistent
Disk with a 16k physical sector size, which promises that aligned 16k
writes will be atomically.  With NVMe, it is possible for a storage
device to promise this without requiring read-modify-write updates for
sub-16k writes.  All that is necessary are some changes in the block
layer so that the kernel does not inadvertently tear a write request
when splitting a bio because it is too large (perhaps because it got
merged with some other request, and then it gets split at an
inconvenient boundary).

There are also more interesting, advanced optimizations that might be
possible.  For example, Jens had observed the passing hints that
journaling writes (either from file systems or databases) could be
potentially useful.  Unfortunately most common storage devices have
not supported write hints, and support for write hints were ripped out
last year.  That can be easily reversed, but there are some other
interesting related subjects that are very much suited for LSF/MM.

For example, most cloud storage devices are doing read-ahead to try to
anticipate read requests from the VM.  This can interfere with the
read-ahead being done by the guest kernel.  So being able to tell
cloud storage device whether a particular read request is stemming
from a read-ahead or not.  At the moment, as Matthew Wilcox has
pointed out, we currently use the read-ahead code path for synchronous
buffered reads.  So plumbing this information so it can passed through
multiple levels of the mm, fs, and block layers will probably be
needed.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-01  3:52 [LSF/MM/BPF TOPIC] Cloud storage optimizations Theodore Ts'o
@ 2023-03-01  4:18 ` Gao Xiang
  2023-03-01  4:40   ` Matthew Wilcox
  2023-03-01  4:35 ` Matthew Wilcox
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 67+ messages in thread
From: Gao Xiang @ 2023-03-01  4:18 UTC (permalink / raw)
  To: Theodore Ts'o, lsf-pc; +Cc: linux-fsdevel, linux-mm, linux-block



On 2023/3/1 11:52, Theodore Ts'o wrote:
> Emulated block devices offered by cloud VM’s can provide functionality
> to guest kernels and applications that traditionally have not been
> available to users of consumer-grade HDD and SSD’s.  For example,
> today it’s possible to create a block device in Google’s Persistent
> Disk with a 16k physical sector size, which promises that aligned 16k
> writes will be atomically.  With NVMe, it is possible for a storage
> device to promise this without requiring read-modify-write updates for
> sub-16k writes.  All that is necessary are some changes in the block
> layer so that the kernel does not inadvertently tear a write request
> when splitting a bio because it is too large (perhaps because it got
> merged with some other request, and then it gets split at an
> inconvenient boundary).

Yeah, most cloud vendors (including Alibaba Cloud) now use ext4 bigalloc
to avoid mysql double write buffers. In addition to improve performance,
this method can also minimize unnecessary I/O traffic between computing
and storage nodes.

Once I hacked a COW-based in-house approach in XFS by using the optimized
always_cow with some tricks to avoid storage dependency.  But nowadays
AWS and Google Cloud are all using ext4 bigalloc, so.. ;-)

> 
> There are also more interesting, advanced optimizations that might be
> possible.  For example, Jens had observed the passing hints that
> journaling writes (either from file systems or databases) could be
> potentially useful.  Unfortunately most common storage devices have
> not supported write hints, and support for write hints were ripped out
> last year.  That can be easily reversed, but there are some other
> interesting related subjects that are very much suited for LSF/MM.
> 
> For example, most cloud storage devices are doing read-ahead to try to
> anticipate read requests from the VM.  This can interfere with the
> read-ahead being done by the guest kernel.  So being able to tell
> cloud storage device whether a particular read request is stemming
> from a read-ahead or not.  At the moment, as Matthew Wilcox has
> pointed out, we currently use the read-ahead code path for synchronous
> buffered reads.  So plumbing this information so it can passed through
> multiple levels of the mm, fs, and block layers will probably be
> needed.

It seems that is also useful as well, yet if my understanding is correct,
it's somewhat unclear for me if we could do more and have a better form
compared with the current REQ_RAHEAD (currently REQ_RAHEAD use cases and
impacts are quite limited.)

Thanks,
Gao Xiang

> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-01  3:52 [LSF/MM/BPF TOPIC] Cloud storage optimizations Theodore Ts'o
  2023-03-01  4:18 ` Gao Xiang
@ 2023-03-01  4:35 ` Matthew Wilcox
  2023-03-01  4:49   ` Gao Xiang
  2023-03-02  3:13 ` Chaitanya Kulkarni
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 67+ messages in thread
From: Matthew Wilcox @ 2023-03-01  4:35 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: lsf-pc, linux-fsdevel, linux-mm, linux-block

On Tue, Feb 28, 2023 at 10:52:15PM -0500, Theodore Ts'o wrote:
> For example, most cloud storage devices are doing read-ahead to try to
> anticipate read requests from the VM.  This can interfere with the
> read-ahead being done by the guest kernel.  So being able to tell
> cloud storage device whether a particular read request is stemming
> from a read-ahead or not.  At the moment, as Matthew Wilcox has
> pointed out, we currently use the read-ahead code path for synchronous
> buffered reads.  So plumbing this information so it can passed through
> multiple levels of the mm, fs, and block layers will probably be
> needed.

This shouldn't be _too_ painful.  For example, the NVMe driver already
does the right thing:

        if (req->cmd_flags & (REQ_FAILFAST_DEV | REQ_RAHEAD))
                control |= NVME_RW_LR;

        if (req->cmd_flags & REQ_RAHEAD)
                dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;

(LR is Limited Retry; FREQ_PREFETCH is "Speculative read. The command
is part of a prefetch operation")

The only problem is that the readahead code doesn't tell the filesystem
whether the request is sync or async.  This should be a simple matter
of adding a new 'bool async' to the readahead_control and then setting
REQ_RAHEAD based on that, rather than on whether the request came in
through readahead() or read_folio() (eg see mpage_readahead()).

Another thing to fix is that SCSI doesn't do anything with the REQ_RAHEAD
flag, so I presume T10 has some work to do (maybe they could borrow the
Access Frequency field from NVMe, since that was what the drive vendors
told us they wanted; maybe they changed their minds since).

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-01  4:18 ` Gao Xiang
@ 2023-03-01  4:40   ` Matthew Wilcox
  2023-03-01  4:59     ` Gao Xiang
  0 siblings, 1 reply; 67+ messages in thread
From: Matthew Wilcox @ 2023-03-01  4:40 UTC (permalink / raw)
  To: Gao Xiang; +Cc: Theodore Ts'o, lsf-pc, linux-fsdevel, linux-mm, linux-block

On Wed, Mar 01, 2023 at 12:18:30PM +0800, Gao Xiang wrote:
> > For example, most cloud storage devices are doing read-ahead to try to
> > anticipate read requests from the VM.  This can interfere with the
> > read-ahead being done by the guest kernel.  So being able to tell
> > cloud storage device whether a particular read request is stemming
> > from a read-ahead or not.  At the moment, as Matthew Wilcox has
> > pointed out, we currently use the read-ahead code path for synchronous
> > buffered reads.  So plumbing this information so it can passed through
> > multiple levels of the mm, fs, and block layers will probably be
> > needed.
> 
> It seems that is also useful as well, yet if my understanding is correct,
> it's somewhat unclear for me if we could do more and have a better form
> compared with the current REQ_RAHEAD (currently REQ_RAHEAD use cases and
> impacts are quite limited.)

I'm pretty sure the Linux readahead algorithms could do with some serious
tuning (as opposed to the hacks the Android device vendors are doing).
Outside my current level of enthusiasm / knowledge, alas.  And it's
hard because while we no longer care about performance on floppies,
we do care about performance from CompactFlash to 8GB/s NVMe drives.
I had one person recently complain that 200Gbps ethernet was too slow
for their storage, so there's a faster usecase to care about.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-01  4:35 ` Matthew Wilcox
@ 2023-03-01  4:49   ` Gao Xiang
  2023-03-01  5:01     ` Matthew Wilcox
  0 siblings, 1 reply; 67+ messages in thread
From: Gao Xiang @ 2023-03-01  4:49 UTC (permalink / raw)
  To: Matthew Wilcox, Theodore Ts'o
  Cc: lsf-pc, linux-fsdevel, linux-mm, linux-block

Hi Matthew!

On 2023/3/1 12:35, Matthew Wilcox wrote:
> On Tue, Feb 28, 2023 at 10:52:15PM -0500, Theodore Ts'o wrote:
>> For example, most cloud storage devices are doing read-ahead to try to
>> anticipate read requests from the VM.  This can interfere with the
>> read-ahead being done by the guest kernel.  So being able to tell
>> cloud storage device whether a particular read request is stemming
>> from a read-ahead or not.  At the moment, as Matthew Wilcox has
>> pointed out, we currently use the read-ahead code path for synchronous
>> buffered reads.  So plumbing this information so it can passed through
>> multiple levels of the mm, fs, and block layers will probably be
>> needed.
> 
> This shouldn't be _too_ painful.  For example, the NVMe driver already
> does the right thing:
> 
>          if (req->cmd_flags & (REQ_FAILFAST_DEV | REQ_RAHEAD))
>                  control |= NVME_RW_LR;
> 
>          if (req->cmd_flags & REQ_RAHEAD)
>                  dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
> 
> (LR is Limited Retry; FREQ_PREFETCH is "Speculative read. The command
> is part of a prefetch operation")
> 
> The only problem is that the readahead code doesn't tell the filesystem
> whether the request is sync or async.  This should be a simple matter
> of adding a new 'bool async' to the readahead_control and then setting
> REQ_RAHEAD based on that, rather than on whether the request came in
> through readahead() or read_folio() (eg see mpage_readahead()).

Great!  In addition to that, just (somewhat) off topic, if we have a
"bool async" now, I think it will immediately have some users (such as
EROFS), since we'd like to do post-processing (such as decompression)
immediately in the same context with sync readahead (due to missing
pages) and leave it to another kworker for async readahead (I think
it's almost same for decryption and verification).

So "bool async" is quite useful on my side if it could be possible
passed to fs side.  I'd like to raise my hands to have it.

Thanks,
Gao Xiang

> 
> Another thing to fix is that SCSI doesn't do anything with the REQ_RAHEAD
> flag, so I presume T10 has some work to do (maybe they could borrow the
> Access Frequency field from NVMe, since that was what the drive vendors
> told us they wanted; maybe they changed their minds since).

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-01  4:40   ` Matthew Wilcox
@ 2023-03-01  4:59     ` Gao Xiang
  0 siblings, 0 replies; 67+ messages in thread
From: Gao Xiang @ 2023-03-01  4:59 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Theodore Ts'o, lsf-pc, linux-fsdevel, linux-mm, linux-block



On 2023/3/1 12:40, Matthew Wilcox wrote:
> On Wed, Mar 01, 2023 at 12:18:30PM +0800, Gao Xiang wrote:
>>> For example, most cloud storage devices are doing read-ahead to try to
>>> anticipate read requests from the VM.  This can interfere with the
>>> read-ahead being done by the guest kernel.  So being able to tell
>>> cloud storage device whether a particular read request is stemming
>>> from a read-ahead or not.  At the moment, as Matthew Wilcox has
>>> pointed out, we currently use the read-ahead code path for synchronous
>>> buffered reads.  So plumbing this information so it can passed through
>>> multiple levels of the mm, fs, and block layers will probably be
>>> needed.
>>
>> It seems that is also useful as well, yet if my understanding is correct,
>> it's somewhat unclear for me if we could do more and have a better form
>> compared with the current REQ_RAHEAD (currently REQ_RAHEAD use cases and
>> impacts are quite limited.)
> 
> I'm pretty sure the Linux readahead algorithms could do with some serious
> tuning (as opposed to the hacks the Android device vendors are doing).
> Outside my current level of enthusiasm / knowledge, alas.  And it's
> hard because while we no longer care about performance on floppies,
> we do care about performance from CompactFlash to 8GB/s NVMe drives.
> I had one person recently complain that 200Gbps ethernet was too slow
> for their storage, so there's a faster usecase to care about.

Yes, we might have a chance to revisit the current readahead algorithm
towards the modern storage devices.  I understand how the current
readahead works but don't have enough slots to analyse the workloads
and investigate more, also such heuristic stuff can have pro-and-con
sides all the time.

As a public cloud vendor, it becomes vital to change since some users
just would like to care about the corner cases compared with other
competitors.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-01  4:49   ` Gao Xiang
@ 2023-03-01  5:01     ` Matthew Wilcox
  2023-03-01  5:09       ` Gao Xiang
  0 siblings, 1 reply; 67+ messages in thread
From: Matthew Wilcox @ 2023-03-01  5:01 UTC (permalink / raw)
  To: Gao Xiang; +Cc: Theodore Ts'o, lsf-pc, linux-fsdevel, linux-mm, linux-block

On Wed, Mar 01, 2023 at 12:49:10PM +0800, Gao Xiang wrote:
> > The only problem is that the readahead code doesn't tell the filesystem
> > whether the request is sync or async.  This should be a simple matter
> > of adding a new 'bool async' to the readahead_control and then setting
> > REQ_RAHEAD based on that, rather than on whether the request came in
> > through readahead() or read_folio() (eg see mpage_readahead()).
> 
> Great!  In addition to that, just (somewhat) off topic, if we have a
> "bool async" now, I think it will immediately have some users (such as
> EROFS), since we'd like to do post-processing (such as decompression)
> immediately in the same context with sync readahead (due to missing
> pages) and leave it to another kworker for async readahead (I think
> it's almost same for decryption and verification).
> 
> So "bool async" is quite useful on my side if it could be possible
> passed to fs side.  I'd like to raise my hands to have it.

That's a really interesting use-case; thanks for bringing it up.

Ideally, we'd have the waiting task do the
decompression/decryption/verification for proper accounting of CPU.
Unfortunately, if the folio isn't uptodate, the task doesn't even hold
a reference to the folio while it waits, so there's no way to wake the
task and let it know that it has work to do.  At least not at the moment
... let me think about that a bit (and if you see a way to do it, feel
free to propose it)

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-01  5:01     ` Matthew Wilcox
@ 2023-03-01  5:09       ` Gao Xiang
  2023-03-01  5:19         ` Gao Xiang
  2023-03-01  5:42         ` Matthew Wilcox
  0 siblings, 2 replies; 67+ messages in thread
From: Gao Xiang @ 2023-03-01  5:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Theodore Ts'o, lsf-pc, linux-fsdevel, linux-mm, linux-block



On 2023/3/1 13:01, Matthew Wilcox wrote:
> On Wed, Mar 01, 2023 at 12:49:10PM +0800, Gao Xiang wrote:
>>> The only problem is that the readahead code doesn't tell the filesystem
>>> whether the request is sync or async.  This should be a simple matter
>>> of adding a new 'bool async' to the readahead_control and then setting
>>> REQ_RAHEAD based on that, rather than on whether the request came in
>>> through readahead() or read_folio() (eg see mpage_readahead()).
>>
>> Great!  In addition to that, just (somewhat) off topic, if we have a
>> "bool async" now, I think it will immediately have some users (such as
>> EROFS), since we'd like to do post-processing (such as decompression)
>> immediately in the same context with sync readahead (due to missing
>> pages) and leave it to another kworker for async readahead (I think
>> it's almost same for decryption and verification).
>>
>> So "bool async" is quite useful on my side if it could be possible
>> passed to fs side.  I'd like to raise my hands to have it.
> 
> That's a really interesting use-case; thanks for bringing it up.
> 
> Ideally, we'd have the waiting task do the
> decompression/decryption/verification for proper accounting of CPU.
> Unfortunately, if the folio isn't uptodate, the task doesn't even hold
> a reference to the folio while it waits, so there's no way to wake the
> task and let it know that it has work to do.  At least not at the moment
> ... let me think about that a bit (and if you see a way to do it, feel
> free to propose it)

Honestly, I'd like to take the folio lock until all post-processing is
done and make it uptodate and unlock so that only we need is to pass
locked-folios requests to kworkers for async way or sync handling in
the original context.

If we unlocked these folios in advance without uptodate, which means
we have to lock it again (which could have more lock contention) and
need to have a way to trace I/Oed but not post-processed stuff in
addition to no I/Oed stuff.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-01  5:09       ` Gao Xiang
@ 2023-03-01  5:19         ` Gao Xiang
  2023-03-01  5:42         ` Matthew Wilcox
  1 sibling, 0 replies; 67+ messages in thread
From: Gao Xiang @ 2023-03-01  5:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Theodore Ts'o, lsf-pc, linux-fsdevel, linux-mm, linux-block



On 2023/3/1 13:09, Gao Xiang wrote:
> 
> 
> On 2023/3/1 13:01, Matthew Wilcox wrote:
>> On Wed, Mar 01, 2023 at 12:49:10PM +0800, Gao Xiang wrote:
>>>> The only problem is that the readahead code doesn't tell the filesystem
>>>> whether the request is sync or async.  This should be a simple matter
>>>> of adding a new 'bool async' to the readahead_control and then setting
>>>> REQ_RAHEAD based on that, rather than on whether the request came in
>>>> through readahead() or read_folio() (eg see mpage_readahead()).
>>>
>>> Great!  In addition to that, just (somewhat) off topic, if we have a
>>> "bool async" now, I think it will immediately have some users (such as
>>> EROFS), since we'd like to do post-processing (such as decompression)
>>> immediately in the same context with sync readahead (due to missing
>>> pages) and leave it to another kworker for async readahead (I think
>>> it's almost same for decryption and verification).
>>>
>>> So "bool async" is quite useful on my side if it could be possible
>>> passed to fs side.  I'd like to raise my hands to have it.
>>
>> That's a really interesting use-case; thanks for bringing it up.
>>
>> Ideally, we'd have the waiting task do the
>> decompression/decryption/verification for proper accounting of CPU.
>> Unfortunately, if the folio isn't uptodate, the task doesn't even hold
>> a reference to the folio while it waits, so there's no way to wake the
>> task and let it know that it has work to do.  At least not at the moment
>> ... let me think about that a bit (and if you see a way to do it, feel
>> free to propose it)
> 
> Honestly, I'd like to take the folio lock until all post-processing is
> done and make it uptodate and unlock so that only we need is to pass
> locked-folios requests to kworkers for async way or sync handling in
> the original context.
> 
> If we unlocked these folios in advance without uptodate, which means
> we have to lock it again (which could have more lock contention) and
> need to have a way to trace I/Oed but not post-processed stuff in
> addition to no I/Oed stuff.

I'm not sure which way is better to proper accounting of CPU, but I
think individual fs could know more than mm about post-processing
handling, I think just have some accounting apis to fses for these.

currently I think core-MM just needs to export "async" bool to rac.
and EROFS now just do sync decompression for <= 4 pages in
z_erofs_readahead(), and I think it can be done better, see:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/erofs/zdata.c?h=v6.2#n832

Thanks,
Gao Xiang

> 
> Thanks,
> Gao Xiang

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-01  5:09       ` Gao Xiang
  2023-03-01  5:19         ` Gao Xiang
@ 2023-03-01  5:42         ` Matthew Wilcox
  2023-03-01  5:51           ` Gao Xiang
  1 sibling, 1 reply; 67+ messages in thread
From: Matthew Wilcox @ 2023-03-01  5:42 UTC (permalink / raw)
  To: Gao Xiang; +Cc: Theodore Ts'o, lsf-pc, linux-fsdevel, linux-mm, linux-block

On Wed, Mar 01, 2023 at 01:09:34PM +0800, Gao Xiang wrote:
> On 2023/3/1 13:01, Matthew Wilcox wrote:
> > On Wed, Mar 01, 2023 at 12:49:10PM +0800, Gao Xiang wrote:
> > > > The only problem is that the readahead code doesn't tell the filesystem
> > > > whether the request is sync or async.  This should be a simple matter
> > > > of adding a new 'bool async' to the readahead_control and then setting
> > > > REQ_RAHEAD based on that, rather than on whether the request came in
> > > > through readahead() or read_folio() (eg see mpage_readahead()).
> > > 
> > > Great!  In addition to that, just (somewhat) off topic, if we have a
> > > "bool async" now, I think it will immediately have some users (such as
> > > EROFS), since we'd like to do post-processing (such as decompression)
> > > immediately in the same context with sync readahead (due to missing
> > > pages) and leave it to another kworker for async readahead (I think
> > > it's almost same for decryption and verification).
> > > 
> > > So "bool async" is quite useful on my side if it could be possible
> > > passed to fs side.  I'd like to raise my hands to have it.
> > 
> > That's a really interesting use-case; thanks for bringing it up.
> > 
> > Ideally, we'd have the waiting task do the
> > decompression/decryption/verification for proper accounting of CPU.
> > Unfortunately, if the folio isn't uptodate, the task doesn't even hold
> > a reference to the folio while it waits, so there's no way to wake the
> > task and let it know that it has work to do.  At least not at the moment
> > ... let me think about that a bit (and if you see a way to do it, feel
> > free to propose it)
> 
> Honestly, I'd like to take the folio lock until all post-processing is
> done and make it uptodate and unlock so that only we need is to pass
> locked-folios requests to kworkers for async way or sync handling in
> the original context.
> 
> If we unlocked these folios in advance without uptodate, which means
> we have to lock it again (which could have more lock contention) and
> need to have a way to trace I/Oed but not post-processed stuff in
> addition to no I/Oed stuff.

Right, look at how it's handled right now ...

sys_read() ends up in filemap_get_pages() which (assuming no folio in
cache) calls page_cache_sync_readahead().  That creates locked, !uptodate
folios and asks the filesystem to fill them.  Unless that completes
incredibly quickly, filemap_get_pages() ends up in filemap_update_page()
which calls folio_put_wait_locked().

If the filesystem BIO completion routine could identify if there was
a task waiting and select one of them, it could wake up the waiter and
pass it a description of what work it needed to do (with the folio still
locked), rather than do the postprocessing itself and unlock the folio.

But that all seems _very_ hard to do with 100% reliability.  Note the
comment in folio_wait_bit_common() which points out that the waiters
bit may be set even when there are no waiters.  The wake_up code
doesn't seem to support this kind of thing (all waiters are
non-exclusive, but only wake up one of them).


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-01  5:42         ` Matthew Wilcox
@ 2023-03-01  5:51           ` Gao Xiang
  2023-03-01  6:00             ` Gao Xiang
  0 siblings, 1 reply; 67+ messages in thread
From: Gao Xiang @ 2023-03-01  5:51 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Theodore Ts'o, lsf-pc, linux-fsdevel, linux-mm, linux-block



On 2023/3/1 13:42, Matthew Wilcox wrote:
> On Wed, Mar 01, 2023 at 01:09:34PM +0800, Gao Xiang wrote:
>> On 2023/3/1 13:01, Matthew Wilcox wrote:
>>> On Wed, Mar 01, 2023 at 12:49:10PM +0800, Gao Xiang wrote:
>>>>> The only problem is that the readahead code doesn't tell the filesystem
>>>>> whether the request is sync or async.  This should be a simple matter
>>>>> of adding a new 'bool async' to the readahead_control and then setting
>>>>> REQ_RAHEAD based on that, rather than on whether the request came in
>>>>> through readahead() or read_folio() (eg see mpage_readahead()).
>>>>
>>>> Great!  In addition to that, just (somewhat) off topic, if we have a
>>>> "bool async" now, I think it will immediately have some users (such as
>>>> EROFS), since we'd like to do post-processing (such as decompression)
>>>> immediately in the same context with sync readahead (due to missing
>>>> pages) and leave it to another kworker for async readahead (I think
>>>> it's almost same for decryption and verification).
>>>>
>>>> So "bool async" is quite useful on my side if it could be possible
>>>> passed to fs side.  I'd like to raise my hands to have it.
>>>
>>> That's a really interesting use-case; thanks for bringing it up.
>>>
>>> Ideally, we'd have the waiting task do the
>>> decompression/decryption/verification for proper accounting of CPU.
>>> Unfortunately, if the folio isn't uptodate, the task doesn't even hold
>>> a reference to the folio while it waits, so there's no way to wake the
>>> task and let it know that it has work to do.  At least not at the moment
>>> ... let me think about that a bit (and if you see a way to do it, feel
>>> free to propose it)
>>
>> Honestly, I'd like to take the folio lock until all post-processing is
>> done and make it uptodate and unlock so that only we need is to pass
>> locked-folios requests to kworkers for async way or sync handling in
>> the original context.
>>
>> If we unlocked these folios in advance without uptodate, which means
>> we have to lock it again (which could have more lock contention) and
>> need to have a way to trace I/Oed but not post-processed stuff in
>> addition to no I/Oed stuff.
> 
> Right, look at how it's handled right now ...
> 
> sys_read() ends up in filemap_get_pages() which (assuming no folio in
> cache) calls page_cache_sync_readahead().  That creates locked, !uptodate
> folios and asks the filesystem to fill them.  Unless that completes
> incredibly quickly, filemap_get_pages() ends up in filemap_update_page()
> which calls folio_put_wait_locked().
> 
> If the filesystem BIO completion routine could identify if there was
> a task waiting and select one of them, it could wake up the waiter and
> pass it a description of what work it needed to do (with the folio still
> locked), rather than do the postprocessing itself and unlock the folio

Currently EROFS sync decompression is waiting in .readahead() with locked
page cache folios, one "completion" together than BIO descriptor
(bi_private) in the original context, so that the filesystem BIO completion
just needs to complete the completion and wakeup the original context
(due to missing pages, so the original context will need the page data
immediately as well) to go on .readhead() and unlock folios.

Does this way have some flew? Or I'm missing something?

Thanks,
Gao Xiang

> 
> But that all seems _very_ hard to do with 100% reliability.  Note the
> comment in folio_wait_bit_common() which points out that the waiters
> bit may be set even when there are no waiters.  The wake_up code
> doesn't seem to support this kind of thing (all waiters are
> non-exclusive, but only wake up one of them).

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-01  5:51           ` Gao Xiang
@ 2023-03-01  6:00             ` Gao Xiang
  0 siblings, 0 replies; 67+ messages in thread
From: Gao Xiang @ 2023-03-01  6:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Theodore Ts'o, lsf-pc, linux-fsdevel, linux-mm, linux-block



On 2023/3/1 13:51, Gao Xiang wrote:
> 
> 
> On 2023/3/1 13:42, Matthew Wilcox wrote:
>> On Wed, Mar 01, 2023 at 01:09:34PM +0800, Gao Xiang wrote:
>>> On 2023/3/1 13:01, Matthew Wilcox wrote:
>>>> On Wed, Mar 01, 2023 at 12:49:10PM +0800, Gao Xiang wrote:
>>>>>> The only problem is that the readahead code doesn't tell the filesystem
>>>>>> whether the request is sync or async.  This should be a simple matter
>>>>>> of adding a new 'bool async' to the readahead_control and then setting
>>>>>> REQ_RAHEAD based on that, rather than on whether the request came in
>>>>>> through readahead() or read_folio() (eg see mpage_readahead()).
>>>>>
>>>>> Great!  In addition to that, just (somewhat) off topic, if we have a
>>>>> "bool async" now, I think it will immediately have some users (such as
>>>>> EROFS), since we'd like to do post-processing (such as decompression)
>>>>> immediately in the same context with sync readahead (due to missing
>>>>> pages) and leave it to another kworker for async readahead (I think
>>>>> it's almost same for decryption and verification).
>>>>>
>>>>> So "bool async" is quite useful on my side if it could be possible
>>>>> passed to fs side.  I'd like to raise my hands to have it.
>>>>
>>>> That's a really interesting use-case; thanks for bringing it up.
>>>>
>>>> Ideally, we'd have the waiting task do the
>>>> decompression/decryption/verification for proper accounting of CPU.
>>>> Unfortunately, if the folio isn't uptodate, the task doesn't even hold
>>>> a reference to the folio while it waits, so there's no way to wake the
>>>> task and let it know that it has work to do.  At least not at the moment
>>>> ... let me think about that a bit (and if you see a way to do it, feel
>>>> free to propose it)
>>>
>>> Honestly, I'd like to take the folio lock until all post-processing is
>>> done and make it uptodate and unlock so that only we need is to pass
>>> locked-folios requests to kworkers for async way or sync handling in
>>> the original context.
>>>
>>> If we unlocked these folios in advance without uptodate, which means
>>> we have to lock it again (which could have more lock contention) and
>>> need to have a way to trace I/Oed but not post-processed stuff in
>>> addition to no I/Oed stuff.
>>
>> Right, look at how it's handled right now ...
>>
>> sys_read() ends up in filemap_get_pages() which (assuming no folio in
>> cache) calls page_cache_sync_readahead().  That creates locked, !uptodate
>> folios and asks the filesystem to fill them.  Unless that completes
>> incredibly quickly, filemap_get_pages() ends up in filemap_update_page()
>> which calls folio_put_wait_locked().
>>
>> If the filesystem BIO completion routine could identify if there was
>> a task waiting and select one of them, it could wake up the waiter and
>> pass it a description of what work it needed to do (with the folio still
>> locked), rather than do the postprocessing itself and unlock the folio
> 
> Currently EROFS sync decompression is waiting in .readahead() with locked
> page cache folios, one "completion" together than BIO descriptor
> (bi_private) in the original context, so that the filesystem BIO completion
> just needs to complete the completion and wakeup the original context
> (due to missing pages, so the original context will need the page data
> immediately as well) to go on .readhead() and unlock folios.
> 
> Does this way have some flew? Or I'm missing something?

In this way, EROFS sync decompression is just all handled with a completion
in .readahead() and mark uptodate & unlock folios before out of .readahead(),
so filemap_update_page() will (almost) always succeed in folio_test_uptodate()
before filemap_update_page():
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/erofs/zdata.c?h=v6.2#n1167
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/erofs/zdata.c?h=v6.2#n1167

So I think core-MM just needs to export "bool async" for fses...

Thanks,
Gao Xiang

> 
> Thanks,
> Gao Xiang
> 
>>
>> But that all seems _very_ hard to do with 100% reliability.  Note the
>> comment in folio_wait_bit_common() which points out that the waiters
>> bit may be set even when there are no waiters.  The wake_up code
>> doesn't seem to support this kind of thing (all waiters are
>> non-exclusive, but only wake up one of them).

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-01  3:52 [LSF/MM/BPF TOPIC] Cloud storage optimizations Theodore Ts'o
  2023-03-01  4:18 ` Gao Xiang
  2023-03-01  4:35 ` Matthew Wilcox
@ 2023-03-02  3:13 ` Chaitanya Kulkarni
  2023-03-02  3:50 ` Darrick J. Wong
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 67+ messages in thread
From: Chaitanya Kulkarni @ 2023-03-02  3:13 UTC (permalink / raw)
  To: Theodore Ts'o, lsf-pc
  Cc: linux-fsdevel, linux-mm, linux-block, linux-nvme

(+linux-nvme)

On 2/28/2023 7:52 PM, Theodore Ts'o wrote:
> Emulated block devices offered by cloud VM’s can provide functionality
> to guest kernels and applications that traditionally have not been
> available to users of consumer-grade HDD and SSD’s.  For example,
> today it’s possible to create a block device in Google’s Persistent
> Disk with a 16k physical sector size, which promises that aligned 16k
> writes will be atomically.  With NVMe, it is possible for a storage
> device to promise this without requiring read-modify-write updates for
> sub-16k writes.  All that is necessary are some changes in the block
> layer so that the kernel does not inadvertently tear a write request
> when splitting a bio because it is too large (perhaps because it got
> merged with some other request, and then it gets split at an
> inconvenient boundary).
> 
> There are also more interesting, advanced optimizations that might be
> possible.  For example, Jens had observed the passing hints that
> journaling writes (either from file systems or databases) could be
> potentially useful.  Unfortunately most common storage devices have
> not supported write hints, and support for write hints were ripped out
> last year.  That can be easily reversed, but there are some other
> interesting related subjects that are very much suited for LSF/MM.
> 
> For example, most cloud storage devices are doing read-ahead to try to
> anticipate read requests from the VM.  This can interfere with the
> read-ahead being done by the guest kernel.  So being able to tell
> cloud storage device whether a particular read request is stemming
> from a read-ahead or not.  At the moment, as Matthew Wilcox has
> pointed out, we currently use the read-ahead code path for synchronous
> buffered reads.  So plumbing this information so it can passed through
> multiple levels of the mm, fs, and block layers will probably be
> needed.
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-01  3:52 [LSF/MM/BPF TOPIC] Cloud storage optimizations Theodore Ts'o
                   ` (2 preceding siblings ...)
  2023-03-02  3:13 ` Chaitanya Kulkarni
@ 2023-03-02  3:50 ` Darrick J. Wong
  2023-03-03  3:03   ` Martin K. Petersen
  2023-03-02 20:30 ` Bart Van Assche
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 67+ messages in thread
From: Darrick J. Wong @ 2023-03-02  3:50 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: lsf-pc, linux-fsdevel, linux-mm, linux-block

On Tue, Feb 28, 2023 at 10:52:15PM -0500, Theodore Ts'o wrote:
> Emulated block devices offered by cloud VM’s can provide functionality
> to guest kernels and applications that traditionally have not been
> available to users of consumer-grade HDD and SSD’s.  For example,
> today it’s possible to create a block device in Google’s Persistent
> Disk with a 16k physical sector size, which promises that aligned 16k
> writes will be atomically.  With NVMe, it is possible for a storage
> device to promise this without requiring read-modify-write updates for
> sub-16k writes.  All that is necessary are some changes in the block
> layer so that the kernel does not inadvertently tear a write request
> when splitting a bio because it is too large (perhaps because it got
> merged with some other request, and then it gets split at an
> inconvenient boundary).

Now that we've flung ourselves into the wild world of Software Defined
Secure Storage as a Service*, I was thinking --

T10 PI gives the kernel a means to associate its own checksums (and a
goofy u16 tag) with LBAs on disk.  There haven't been that many actual
SCSI devices that implement it, but I wonder how hard it would be for
clod storage backends to export things like that?  The storage nodes
often have a bit more CPU power, too.

Though admittedly the advent of customer-managed FDE in the cloud and
might make that less useful?

Just my random 2c late at night,

--D

* SDSSAAS: what you get from banging head on keyboard in frustration

> There are also more interesting, advanced optimizations that might be
> possible.  For example, Jens had observed the passing hints that
> journaling writes (either from file systems or databases) could be
> potentially useful.  Unfortunately most common storage devices have
> not supported write hints, and support for write hints were ripped out
> last year.  That can be easily reversed, but there are some other
> interesting related subjects that are very much suited for LSF/MM.
> 
> For example, most cloud storage devices are doing read-ahead to try to
> anticipate read requests from the VM.  This can interfere with the
> read-ahead being done by the guest kernel.  So being able to tell
> cloud storage device whether a particular read request is stemming
> from a read-ahead or not.  At the moment, as Matthew Wilcox has
> pointed out, we currently use the read-ahead code path for synchronous
> buffered reads.  So plumbing this information so it can passed through
> multiple levels of the mm, fs, and block layers will probably be
> needed.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-01  3:52 [LSF/MM/BPF TOPIC] Cloud storage optimizations Theodore Ts'o
                   ` (3 preceding siblings ...)
  2023-03-02  3:50 ` Darrick J. Wong
@ 2023-03-02 20:30 ` Bart Van Assche
  2023-03-03  3:05   ` Martin K. Petersen
  2023-03-03  1:58 ` Keith Busch
  2023-03-03  2:54 ` Martin K. Petersen
  6 siblings, 1 reply; 67+ messages in thread
From: Bart Van Assche @ 2023-03-02 20:30 UTC (permalink / raw)
  To: Theodore Ts'o, lsf-pc; +Cc: linux-fsdevel, linux-mm, linux-block

On 2/28/23 19:52, Theodore Ts'o wrote:
> Unfortunately most common storage devices have
> not supported write hints, and support for write hints were ripped out
> last year.

Work is ongoing in T10 to add write hint support to SBC. We plan to 
propose to restore write hint support after there is agreement in T10 
about the approach. See also "Constrained SBC-5 Streams" 
(http://www.t10.org/cgi-bin/ac.pl?t=d&f=23-024r0.pdf). This proposal has 
been uploaded yesterday.

Bart.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-01  3:52 [LSF/MM/BPF TOPIC] Cloud storage optimizations Theodore Ts'o
                   ` (4 preceding siblings ...)
  2023-03-02 20:30 ` Bart Van Assche
@ 2023-03-03  1:58 ` Keith Busch
  2023-03-03  3:49   ` Matthew Wilcox
  2023-03-03  2:54 ` Martin K. Petersen
  6 siblings, 1 reply; 67+ messages in thread
From: Keith Busch @ 2023-03-03  1:58 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: lsf-pc, linux-fsdevel, linux-mm, linux-block

On Tue, Feb 28, 2023 at 10:52:15PM -0500, Theodore Ts'o wrote:
> Emulated block devices offered by cloud VM’s can provide functionality
> to guest kernels and applications that traditionally have not been
> available to users of consumer-grade HDD and SSD’s.  For example,
> today it’s possible to create a block device in Google’s Persistent
> Disk with a 16k physical sector size, which promises that aligned 16k
> writes will be atomically.  With NVMe, it is possible for a storage
> device to promise this without requiring read-modify-write updates for
> sub-16k writes. 

I'm not sure it does. NVMe spec doesn't say AWUN writes are never a RMW
operation. NVMe suggests aligning to NPWA is the best way to avoid RMW, but
doesn't guarantee that, nor does it require this limit aligns to atomic
boundaries. NVMe provides a lot of hints, but stops short of promises. Vendors
can promise whatever they want, but that's outside spec.

> All that is necessary are some changes in the block
> layer so that the kernel does not inadvertently tear a write request
> when splitting a bio because it is too large (perhaps because it got
> merged with some other request, and then it gets split at an
> inconvenient boundary).

All the limits needed to optimally split on phyiscal boundaries exist, so I
hope we're using them correctly via get_max_io_size().

That said, I was hoping you were going to suggest supporting 16k logical block
sizes. Not a problem on some arch's, but still problematic when PAGE_SIZE is
4k. :)

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-01  3:52 [LSF/MM/BPF TOPIC] Cloud storage optimizations Theodore Ts'o
                   ` (5 preceding siblings ...)
  2023-03-03  1:58 ` Keith Busch
@ 2023-03-03  2:54 ` Martin K. Petersen
  2023-03-03  3:29   ` Keith Busch
  2023-03-03  4:20   ` Theodore Ts'o
  6 siblings, 2 replies; 67+ messages in thread
From: Martin K. Petersen @ 2023-03-03  2:54 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: lsf-pc, linux-fsdevel, linux-mm, linux-block


Hi Ted!

> With NVMe, it is possible for a storage device to promise this without
> requiring read-modify-write updates for sub-16k writes.  All that is
> necessary are some changes in the block layer so that the kernel does
> not inadvertently tear a write request when splitting a bio because it
> is too large (perhaps because it got merged with some other request,
> and then it gets split at an inconvenient boundary).

We have been working on support for atomic writes and it is not a simple
as it sounds. Atomic operations in SCSI and NVMe have semantic
differences which are challenging to reconcile. On top of that, both the
SCSI and NVMe specs are buggy in the atomics department. We are working
to get things fixed in both standards and aim to discuss our
implementation at LSF/MM.

> There are also more interesting, advanced optimizations that might be
> possible.  For example, Jens had observed the passing hints that
> journaling writes (either from file systems or databases) could be
> potentially useful.

Yep. We got very impressive results identifying journal writes and the
kernel implementation was completely trivial, but...

> Unfortunately most common storage devices have not supported write
> hints, and support for write hints were ripped out last year.  That
> can be easily reversed, but there are some other interesting related
> subjects that are very much suited for LSF/MM.

Hinting didn't see widespread adoption because we in Linux, as well as
the various interested databases, preferred hints to be per-I/O
properties. Whereas $OTHER_OS insisted that hints should be statically
assigned to LBA ranges on media. This left vendors having to choose
between two very different approaches and consequently they chose not to
support any of them.

However, hints are coming back in various forms for non-enterprise and
cloud storage devices so it's good to revive this discussion.

> For example, most cloud storage devices are doing read-ahead to try to
> anticipate read requests from the VM.  This can interfere with the
> read-ahead being done by the guest kernel.  So being able to tell
> cloud storage device whether a particular read request is stemming
> from a read-ahead or not.

Indeed. In our experience the hints that work best are the ones which
convey to the storage device why the I/O is being performed.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-02  3:50 ` Darrick J. Wong
@ 2023-03-03  3:03   ` Martin K. Petersen
  0 siblings, 0 replies; 67+ messages in thread
From: Martin K. Petersen @ 2023-03-03  3:03 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Theodore Ts'o, lsf-pc, linux-fsdevel, linux-mm, linux-block


Darrick,

> T10 PI gives the kernel a means to associate its own checksums (and a
> goofy u16 tag) with LBAs on disk.  There haven't been that many actual
> SCSI devices that implement it,

Storage arrays have traditionally put their own internal magic in that
tag space and therefore did not allow filesystems to use it.

That has changed with the latest NVMe PI amendments which allow a larger
tag (and CRC). The tag space can be split between storage and
application/filesystem use. There are definitely interesting things that
can be done in this area.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-02 20:30 ` Bart Van Assche
@ 2023-03-03  3:05   ` Martin K. Petersen
  0 siblings, 0 replies; 67+ messages in thread
From: Martin K. Petersen @ 2023-03-03  3:05 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Theodore Ts'o, lsf-pc, linux-fsdevel, linux-mm, linux-block


Bart,

> Work is ongoing in T10 to add write hint support to SBC. We plan to
> propose to restore write hint support after there is agreement in T10
> about the approach. See also "Constrained SBC-5 Streams"
> (http://www.t10.org/cgi-bin/ac.pl?t=d&f=23-024r0.pdf). This proposal
> has been uploaded yesterday.

Why have the streams dependency?

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-03  2:54 ` Martin K. Petersen
@ 2023-03-03  3:29   ` Keith Busch
  2023-03-03  4:20   ` Theodore Ts'o
  1 sibling, 0 replies; 67+ messages in thread
From: Keith Busch @ 2023-03-03  3:29 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Theodore Ts'o, lsf-pc, linux-fsdevel, linux-mm, linux-block

On Thu, Mar 02, 2023 at 09:54:59PM -0500, Martin K. Petersen wrote:
> > For example, most cloud storage devices are doing read-ahead to try to
> > anticipate read requests from the VM.  This can interfere with the
> > read-ahead being done by the guest kernel.  So being able to tell
> > cloud storage device whether a particular read request is stemming
> > from a read-ahead or not.
> 
> Indeed. In our experience the hints that work best are the ones which
> convey to the storage device why the I/O is being performed.

This may be a pretty far-out-there idea, but I think SSD BPF injection has
potentially higher payoff than mere hints. The below paper does it in kernel,
but imagine doing it on the device!

  https://dl.acm.org/doi/pdf/10.1145/3458336.3465290

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-03  1:58 ` Keith Busch
@ 2023-03-03  3:49   ` Matthew Wilcox
  2023-03-03 11:32     ` Hannes Reinecke
                       ` (2 more replies)
  0 siblings, 3 replies; 67+ messages in thread
From: Matthew Wilcox @ 2023-03-03  3:49 UTC (permalink / raw)
  To: Keith Busch
  Cc: Luis Chamberlain, Theodore Ts'o, lsf-pc, linux-fsdevel,
	linux-mm, linux-block

On Thu, Mar 02, 2023 at 06:58:58PM -0700, Keith Busch wrote:
> That said, I was hoping you were going to suggest supporting 16k logical block
> sizes. Not a problem on some arch's, but still problematic when PAGE_SIZE is
> 4k. :)

I was hoping Luis was going to propose a session on LBA size > PAGE_SIZE.
Funnily, while the pressure is coming from the storage vendors, I don't
think there's any work to be done in the storage layers.  It's purely
a FS+MM problem.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-03  2:54 ` Martin K. Petersen
  2023-03-03  3:29   ` Keith Busch
@ 2023-03-03  4:20   ` Theodore Ts'o
  1 sibling, 0 replies; 67+ messages in thread
From: Theodore Ts'o @ 2023-03-03  4:20 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: lsf-pc, linux-fsdevel, linux-mm, linux-block

On Thu, Mar 02, 2023 at 09:54:59PM -0500, Martin K. Petersen wrote:
> 
> Hi Ted!
> 
> > With NVMe, it is possible for a storage device to promise this without
> > requiring read-modify-write updates for sub-16k writes.  All that is
> > necessary are some changes in the block layer so that the kernel does
> > not inadvertently tear a write request when splitting a bio because it
> > is too large (perhaps because it got merged with some other request,
> > and then it gets split at an inconvenient boundary).
> 
> We have been working on support for atomic writes and it is not a simple
> as it sounds. Atomic operations in SCSI and NVMe have semantic
> differences which are challenging to reconcile. On top of that, both the
> SCSI and NVMe specs are buggy in the atomics department. We are working
> to get things fixed in both standards and aim to discuss our
> implementation at LSF/MM.

I'd be very interested to learn more about what you've found.  I know
more than one cloud provider is thinking about how to use the NVMe
spec to send information about how their emulated block device work.
This has come up at our weekly ext4 video conference, and given that I
gave a talk about it in 2018[1], there's quite a lot of similarity of
what folks are thinking about.  Basically, MySQL and Postgres use 16k
database pages, and if we can avoid their special doublewrite
techniques to avoid torn writes, because they can depend on their
Cloud Block Devices Working A Certain Way, it can make for very
noticeable performance improvements.

[1] https://www.youtube.com/watch?v=gIeuiGg-_iw

So while the standards might allow standards-compliant physical
devices to do some really wierd sh*t, it might be that if all cloud
vendors do things in the same way, I could see various cloud workloads
starting to depending on extra-standard behaviour, much like a lot of
sysadmins assume that low-numbered LBA's are on the outer diamenter of
the HDD and are much more performant than sectors on the i.d. of the
HDD.  This is completely not guaranteed by the standard specs, but
it's become a defacto standard.

That's not a great place to be, and it would be great if can find ways
that are much more reliable in terms of querying a standards-compliant
storage device and knowing whether we can depend on a certain behavior
--- but I also know how slowly storage standards bodies move.  :-(

> Hinting didn't see widespread adoption because we in Linux, as well as
> the various interested databases, preferred hints to be per-I/O
> properties. Whereas $OTHER_OS insisted that hints should be statically
> assigned to LBA ranges on media. This left vendors having to choose
> between two very different approaches and consequently they chose not to
> support any of them.

I wasn't aware of that history.  Thanks for filling that bit in.

Fortunately, in 2023, it appears that for many cloud vendors, the
teams involved care a lot more about Linux than $OTHER_OS.  So
hopefully we'll have a lot more success in getting write hints
generally available to hyperscale cloud customers.

From an industry-wide perspective, it would be useful if the write
hints used by Hyperscale Cloud Vendor #1 are very similar to what
write hints are supported by Hyperscale Cloud Vendor #2.  Standards
committees aren't the only way that companies can collaborate in an
anti-trust compliant way.  Open source is another way; and especially
if we can show that a set of hints work well for the Linux kernel and
Linux applications ---- then what we ship in the Linux kernel can help
shape the set of "write hints" that cloud storage devices will
support.

					- Ted

P.S.  From a LSF/MM program perspective, I suspect we may want to have
more than one session; one that is focused on standards and atomic
writes, and another that is focused on write hints.  The first might
be mostly block and fs focused, and the second would probably be of
interest to mm folks as well.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-03  3:49   ` Matthew Wilcox
@ 2023-03-03 11:32     ` Hannes Reinecke
  2023-03-03 13:11     ` James Bottomley
  2023-03-03 21:45     ` Luis Chamberlain
  2 siblings, 0 replies; 67+ messages in thread
From: Hannes Reinecke @ 2023-03-03 11:32 UTC (permalink / raw)
  To: Matthew Wilcox, Keith Busch
  Cc: Luis Chamberlain, Theodore Ts'o, lsf-pc, linux-fsdevel,
	linux-mm, linux-block

On 3/3/23 04:49, Matthew Wilcox wrote:
> On Thu, Mar 02, 2023 at 06:58:58PM -0700, Keith Busch wrote:
>> That said, I was hoping you were going to suggest supporting 16k logical block
>> sizes. Not a problem on some arch's, but still problematic when PAGE_SIZE is
>> 4k. :)
> 
> I was hoping Luis was going to propose a session on LBA size > PAGE_SIZE.
> Funnily, while the pressure is coming from the storage vendors, I don't
> think there's any work to be done in the storage layers.  It's purely
> a FS+MM problem.

Would love to have that session, though.
Luis?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Frankenstr. 146, 90461 Nürnberg
Managing Directors: I. Totev, A. Myers, A. McDonald, M. B. Moerman
(HRB 36809, AG Nürnberg)


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-03  3:49   ` Matthew Wilcox
  2023-03-03 11:32     ` Hannes Reinecke
@ 2023-03-03 13:11     ` James Bottomley
  2023-03-04  7:34       ` Matthew Wilcox
  2023-03-03 21:45     ` Luis Chamberlain
  2 siblings, 1 reply; 67+ messages in thread
From: James Bottomley @ 2023-03-03 13:11 UTC (permalink / raw)
  To: Matthew Wilcox, Keith Busch
  Cc: Luis Chamberlain, Theodore Ts'o, lsf-pc, linux-fsdevel,
	linux-mm, linux-block

On Fri, 2023-03-03 at 03:49 +0000, Matthew Wilcox wrote:
> On Thu, Mar 02, 2023 at 06:58:58PM -0700, Keith Busch wrote:
> > That said, I was hoping you were going to suggest supporting 16k
> > logical block sizes. Not a problem on some arch's, but still
> > problematic when PAGE_SIZE is 4k. :)
> 
> I was hoping Luis was going to propose a session on LBA size >
> PAGE_SIZE. Funnily, while the pressure is coming from the storage
> vendors, I don't think there's any work to be done in the storage
> layers.  It's purely a FS+MM problem.

Heh, I can do the fools rush in bit, especially if what we're
interested in the minimum it would take to support this ...

The FS problem could be solved simply by saying FS block size must
equal device block size, then it becomes purely a MM issue.  The MM
issue could be solved by adding a page order attribute to struct
address_space and insisting that pagecache/filemap functions in
mm/filemap.c all have to operate on objects that are an integer
multiple of the address space order.  The base allocator is
filemap_alloc_folio, which already has an apparently always zero order
parameter (hmmm...) and it always seems to be called from sites that
have the address_space, so it could simply be modified to always
operate at the address_space order.

The above would be a bit suboptimal in that blocks are always mapped to
physically contiguous pages, but it should be enough to get the concept
working.

James


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-03  3:49   ` Matthew Wilcox
  2023-03-03 11:32     ` Hannes Reinecke
  2023-03-03 13:11     ` James Bottomley
@ 2023-03-03 21:45     ` Luis Chamberlain
  2023-03-03 22:07       ` Keith Busch
                         ` (2 more replies)
  2 siblings, 3 replies; 67+ messages in thread
From: Luis Chamberlain @ 2023-03-03 21:45 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Keith Busch, Theodore Ts'o, Pankaj Raghav, Daniel Gomez,
	Javier González, lsf-pc, linux-fsdevel, linux-mm,
	linux-block

On Fri, Mar 03, 2023 at 03:49:29AM +0000, Matthew Wilcox wrote:
> On Thu, Mar 02, 2023 at 06:58:58PM -0700, Keith Busch wrote:
> > That said, I was hoping you were going to suggest supporting 16k logical block
> > sizes. Not a problem on some arch's, but still problematic when PAGE_SIZE is
> > 4k. :)
> 
> I was hoping Luis was going to propose a session on LBA size > PAGE_SIZE.
> Funnily, while the pressure is coming from the storage vendors, I don't
> think there's any work to be done in the storage layers.  It's purely
> a FS+MM problem.

You'd hope most of it is left to FS + MM, but I'm not yet sure that's
quite it yet. Initial experimentation shows just enabling > PAGE_SIZE
physical & logical block NVMe devices gets brought down to 512 bytes.
That seems odd to say the least. Would changing this be an issue now?

I'm gathering there is generic interest in this topic though. So one
thing we *could* do is perhaps review lay-of-the-land of interest and
break down what we all think are things likely could be done / needed.
At the very least we can come out together knowing the unknowns together.

I started to think about some of these things a while ago and with the
help of Willy I tried to break down some of the items I gathered from him
into community OKRs (super informal itemization of goals and sub tasks which
would complete such goals) and started trying to take a stab at them
with our team, but obviously I think it would be great if we all just
divide & and conquer here. So maybe reviewing these and extending them
as a community would be good:

https://kernelnewbies.org/KernelProjects/large-block-size

I'm recently interested in tmpfs so will be taking a stab at higher
order page size support there to see what blows up.

The other stuff like general IOMAP conversion is pretty well known, and
we already I think have a proposed session on that. But there is also
even smaller fish to fry, like *just* doing a baseline with some
filesystems with 4 KiB block size seems in order.

Hearing filesystem developer's thoughts on support for larger block
size in light of lower order PAGE_SIZE would be good, given one of the
odd situations some distributions / teams find themselves in is trying
to support larger block sizes but with difficult access to higher
PAGE_SIZE systems. Are there ways to simplify this / help us in general?
Without it's a bit hard to muck around with some of this in terms of
support long term. This also got me thinking about ways to try to replicate
larger IO virtual devices a bit better too. While paying a cloud
provider to test this is one nice option, it'd be great if I can just do
this in house with some hacks too. For virtio-blk-pci at least, for instance,
I wondered whether using just the host page cache suffices, or would a 4K
page cache on the host modify say a 16 k emualated io controller results
significantly? How do we most effectively virtualize 16k controllers
in-house?

To help with experimenting with large io and NVMe / virtio-blk-pci I
recented added support to intantiate tons of large IO devices to kdevops
[0], with it it should be easy to reproduce odd issues we may come up
with. For instnace it should be possible to subsequently extend the
kdevops fstests or blktests automation support with just a few Kconfig files
to use some of these largio devices to see what blows up.

If we are going to have this session I'd like to encourage & invite Pankaj and
Daniel who have been doing great work on reviewing all this too and can give
some feedback on some of their own findings!

[0] https://github.com/linux-kdevops/kdevops/commit/af33568445111cc114653264f6dbc8684f3b10e8

  Luis

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-03 21:45     ` Luis Chamberlain
@ 2023-03-03 22:07       ` Keith Busch
  2023-03-03 22:14         ` Luis Chamberlain
  2023-03-03 23:51       ` Bart Van Assche
  2023-03-04 11:08       ` Hannes Reinecke
  2 siblings, 1 reply; 67+ messages in thread
From: Keith Busch @ 2023-03-03 22:07 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Matthew Wilcox, Theodore Ts'o, Pankaj Raghav, Daniel Gomez,
	Javier González, lsf-pc, linux-fsdevel, linux-mm,
	linux-block

On Fri, Mar 03, 2023 at 01:45:48PM -0800, Luis Chamberlain wrote:
> 
> You'd hope most of it is left to FS + MM, but I'm not yet sure that's
> quite it yet. Initial experimentation shows just enabling > PAGE_SIZE
> physical & logical block NVMe devices gets brought down to 512 bytes.
> That seems odd to say the least. Would changing this be an issue now?

I think you're talking about removing this part:

---
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index c2730b116dc68..2c528f56c2973 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1828,17 +1828,7 @@ static void nvme_update_disk_info(struct gendisk *disk,
 	unsigned short bs = 1 << ns->lba_shift;
 	u32 atomic_bs, phys_bs, io_opt = 0;
 
-	/*
-	 * The block layer can't support LBA sizes larger than the page size
-	 * yet, so catch this early and don't allow block I/O.
-	 */
-	if (ns->lba_shift > PAGE_SHIFT) {
-		capacity = 0;
-		bs = (1 << 9);
-	}
-
 	blk_integrity_unregister(disk);
-
 	atomic_bs = phys_bs = bs;
 	if (id->nabo == 0) {
 		/*
--

This is what happens today if the driver were to let the disk create with its
actual size (testing 8k LBA size on x86):

 BUG: kernel NULL pointer dereference, address: 0000000000000008
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] SMP
 CPU: 10 PID: 115 Comm: kworker/u32:2 Not tainted 6.2.0-00032-gdb7183e3c314-dirty #105
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
 Workqueue: nvme-wq nvme_scan_work
 RIP: 0010:create_empty_buffers+0x24/0x240
 Code: 66 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 d4 ba 01 00 00 00 55 53 48 89 fb e8 17 f5 ff ff 48 89 c5 48 89 c2 eb 03 48 89 ca <48> 8b 4a 08 4c 09 22 48 85 c9 75 f1 48 89 6a 08 48 8b 43 18 48 8d
 RSP: 0000:ffffc900004578f0 EFLAGS: 00010286
 RAX: 0000000000000000 RBX: ffffea0000152580 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffea0000152580
 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
 R10: ffff88803ecb6c18 R11: 0000000000000000 R12: 0000000000000000
 R13: ffffea0000152580 R14: 0000000000100cc0 R15: ffff888017030288
 FS:  0000000000000000(0000) GS:ffff88803ec80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000008 CR3: 0000000002c2a001 CR4: 0000000000770ee0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 PKRU: 55555554
 Call Trace:
  <TASK>
  ? blkdev_readahead+0x20/0x20
  create_page_buffers+0x79/0x90
  block_read_full_folio+0x58/0x410
  ? blkdev_write_begin+0x20/0x20
  ? xas_store+0x56/0x5b0
  ? xas_load+0x8/0x40
  ? xa_get_order+0x51/0xe0
  ? __mod_memcg_lruvec_state+0x41/0x90
  ? blkdev_readahead+0x20/0x20
  ? blkdev_readahead+0x20/0x20
  filemap_read_folio+0x41/0x2a0
  ? scan_shadow_nodes+0x30/0x30
  ? blkdev_readahead+0x20/0x20
  ? folio_add_lru+0x2d/0x40
  ? blkdev_readahead+0x20/0x20
  do_read_cache_folio+0x103/0x420
  ? __switch_to_asm+0x3a/0x60
  ? __switch_to_asm+0x34/0x60
  ? get_page_from_freelist+0x735/0x1070
  read_part_sector+0x2f/0xa0
  read_lba+0xa2/0x150
  efi_partition+0xdb/0x760
  ? snprintf+0x49/0x60
  ? is_gpt_valid.part.5+0x3f0/0x3f0
  bdev_disk_changed+0x1ce/0x560
  blkdev_get_whole+0x73/0x80
  blkdev_get_by_dev+0x199/0x2e0
  disk_scan_partitions+0x63/0xd0
  device_add_disk+0x3c0/0x3d0
  nvme_scan_ns+0x574/0xcc0
  ? nvme_scan_work+0x23a/0x3f0
  nvme_scan_work+0x23a/0x3f0
  process_one_work+0x1da/0x3a0
  worker_thread+0x205/0x3a0
  ? process_one_work+0x3a0/0x3a0
  kthread+0xc0/0xe0
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x1f/0x30
  </TASK>

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-03 22:07       ` Keith Busch
@ 2023-03-03 22:14         ` Luis Chamberlain
  2023-03-03 22:32           ` Keith Busch
  0 siblings, 1 reply; 67+ messages in thread
From: Luis Chamberlain @ 2023-03-03 22:14 UTC (permalink / raw)
  To: Keith Busch
  Cc: Matthew Wilcox, Theodore Ts'o, Pankaj Raghav, Daniel Gomez,
	Javier González, lsf-pc, linux-fsdevel, linux-mm,
	linux-block

On Fri, Mar 03, 2023 at 03:07:55PM -0700, Keith Busch wrote:
> On Fri, Mar 03, 2023 at 01:45:48PM -0800, Luis Chamberlain wrote:
> > 
> > You'd hope most of it is left to FS + MM, but I'm not yet sure that's
> > quite it yet. Initial experimentation shows just enabling > PAGE_SIZE
> > physical & logical block NVMe devices gets brought down to 512 bytes.
> > That seems odd to say the least. Would changing this be an issue now?
> 
> I think you're talking about removing this part:
> 
> ---
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index c2730b116dc68..2c528f56c2973 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1828,17 +1828,7 @@ static void nvme_update_disk_info(struct gendisk *disk,
>  	unsigned short bs = 1 << ns->lba_shift;
>  	u32 atomic_bs, phys_bs, io_opt = 0;
>  
> -	/*
> -	 * The block layer can't support LBA sizes larger than the page size
> -	 * yet, so catch this early and don't allow block I/O.
> -	 */
> -	if (ns->lba_shift > PAGE_SHIFT) {
> -		capacity = 0;
> -		bs = (1 << 9);
> -	}
> -
>  	blk_integrity_unregister(disk);
> -
>  	atomic_bs = phys_bs = bs;

Yes, clearly it says *yet* so that begs the question what would be
required?

Also, going down to 512 seems a bit dramatic, so why not just match the
PAGE_SIZE so 4k? Would such a comprmise for now break some stuff?

  Luis

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-03 22:14         ` Luis Chamberlain
@ 2023-03-03 22:32           ` Keith Busch
  2023-03-03 23:09             ` Luis Chamberlain
  2023-03-16 15:29             ` Pankaj Raghav
  0 siblings, 2 replies; 67+ messages in thread
From: Keith Busch @ 2023-03-03 22:32 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Matthew Wilcox, Theodore Ts'o, Pankaj Raghav, Daniel Gomez,
	Javier González, lsf-pc, linux-fsdevel, linux-mm,
	linux-block

On Fri, Mar 03, 2023 at 02:14:55PM -0800, Luis Chamberlain wrote:
> On Fri, Mar 03, 2023 at 03:07:55PM -0700, Keith Busch wrote:
> > On Fri, Mar 03, 2023 at 01:45:48PM -0800, Luis Chamberlain wrote:
> > > 
> > > You'd hope most of it is left to FS + MM, but I'm not yet sure that's
> > > quite it yet. Initial experimentation shows just enabling > PAGE_SIZE
> > > physical & logical block NVMe devices gets brought down to 512 bytes.
> > > That seems odd to say the least. Would changing this be an issue now?
> > 
> > I think you're talking about removing this part:
> > 
> > ---
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index c2730b116dc68..2c528f56c2973 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -1828,17 +1828,7 @@ static void nvme_update_disk_info(struct gendisk *disk,
> >  	unsigned short bs = 1 << ns->lba_shift;
> >  	u32 atomic_bs, phys_bs, io_opt = 0;
> >  
> > -	/*
> > -	 * The block layer can't support LBA sizes larger than the page size
> > -	 * yet, so catch this early and don't allow block I/O.
> > -	 */
> > -	if (ns->lba_shift > PAGE_SHIFT) {
> > -		capacity = 0;
> > -		bs = (1 << 9);
> > -	}
> > -
> >  	blk_integrity_unregister(disk);
> > -
> >  	atomic_bs = phys_bs = bs;
> 
> Yes, clearly it says *yet* so that begs the question what would be
> required?

Oh, gotcha. I'll work on a list of places it currently crashes.
 
> Also, going down to 512 seems a bit dramatic, so why not just match the
> PAGE_SIZE so 4k? Would such a comprmise for now break some stuff?

The capacity set to zero ensures it can't be used through the block stack, so
the logical block size limit is unused. 512 is just a default value. We only
bring up the handle so you can administrate it with passthrough commands.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-03 22:32           ` Keith Busch
@ 2023-03-03 23:09             ` Luis Chamberlain
  2023-03-16 15:29             ` Pankaj Raghav
  1 sibling, 0 replies; 67+ messages in thread
From: Luis Chamberlain @ 2023-03-03 23:09 UTC (permalink / raw)
  To: Keith Busch
  Cc: Matthew Wilcox, Theodore Ts'o, Pankaj Raghav, Daniel Gomez,
	Javier González, lsf-pc, linux-fsdevel, linux-mm,
	linux-block

On Fri, Mar 03, 2023 at 03:32:08PM -0700, Keith Busch wrote:
> On Fri, Mar 03, 2023 at 02:14:55PM -0800, Luis Chamberlain wrote:
> > On Fri, Mar 03, 2023 at 03:07:55PM -0700, Keith Busch wrote:
> > > On Fri, Mar 03, 2023 at 01:45:48PM -0800, Luis Chamberlain wrote:
> > > > 
> > > > You'd hope most of it is left to FS + MM, but I'm not yet sure that's
> > > > quite it yet. Initial experimentation shows just enabling > PAGE_SIZE
> > > > physical & logical block NVMe devices gets brought down to 512 bytes.
> > > > That seems odd to say the least. Would changing this be an issue now?
> > > 
> > > I think you're talking about removing this part:
> > > 
> > > ---
> > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > > index c2730b116dc68..2c528f56c2973 100644
> > > --- a/drivers/nvme/host/core.c
> > > +++ b/drivers/nvme/host/core.c
> > > @@ -1828,17 +1828,7 @@ static void nvme_update_disk_info(struct gendisk *disk,
> > >  	unsigned short bs = 1 << ns->lba_shift;
> > >  	u32 atomic_bs, phys_bs, io_opt = 0;
> > >  
> > > -	/*
> > > -	 * The block layer can't support LBA sizes larger than the page size
> > > -	 * yet, so catch this early and don't allow block I/O.
> > > -	 */
> > > -	if (ns->lba_shift > PAGE_SHIFT) {
> > > -		capacity = 0;
> > > -		bs = (1 << 9);
> > > -	}
> > > -
> > >  	blk_integrity_unregister(disk);
> > > -
> > >  	atomic_bs = phys_bs = bs;
> > 
> > Yes, clearly it says *yet* so that begs the question what would be
> > required?
> 
> Oh, gotcha. I'll work on a list of places it currently crashes.

Awesome that then is part of our dirty laundry TODO for NVMe for larger IO.

> > Also, going down to 512 seems a bit dramatic, so why not just match the
> > PAGE_SIZE so 4k? Would such a comprmise for now break some stuff?
> 
> The capacity set to zero ensures it can't be used through the block stack, so
> the logical block size limit is unused.

Oh OK so in effect we won't have compat issues if we decide later to
change this. So block devices just won't be cabable of working? That
save me tons of tests.

> 512 is just a default value. We only
> bring up the handle so you can administrate it with passthrough commands.

So we'd use 512 for passthrough, but otherwise it won't work ?

  Luis

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-03 21:45     ` Luis Chamberlain
  2023-03-03 22:07       ` Keith Busch
@ 2023-03-03 23:51       ` Bart Van Assche
  2023-03-04 11:08       ` Hannes Reinecke
  2 siblings, 0 replies; 67+ messages in thread
From: Bart Van Assche @ 2023-03-03 23:51 UTC (permalink / raw)
  To: Luis Chamberlain, Matthew Wilcox
  Cc: Keith Busch, Theodore Ts'o, Pankaj Raghav, Daniel Gomez,
	Javier González, lsf-pc, linux-fsdevel, linux-mm,
	linux-block

On 3/3/23 13:45, Luis Chamberlain wrote:
> I'm gathering there is generic interest in this topic though.

Some Android storage vendors are interested in larger block sizes, e.g. 
16 KiB. Android currently uses UFS storage and may switch to NVMe in the 
future.

> This also got me thinking about ways to try to replicate
> larger IO virtual devices a bit better too.

Is null_blk good enough to test large block size support in filesystems 
and the block layer?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-03 13:11     ` James Bottomley
@ 2023-03-04  7:34       ` Matthew Wilcox
  2023-03-04 13:41         ` James Bottomley
  2023-03-04 19:04         ` Luis Chamberlain
  0 siblings, 2 replies; 67+ messages in thread
From: Matthew Wilcox @ 2023-03-04  7:34 UTC (permalink / raw)
  To: James Bottomley
  Cc: Keith Busch, Luis Chamberlain, Theodore Ts'o, lsf-pc,
	linux-fsdevel, linux-mm, linux-block

On Fri, Mar 03, 2023 at 08:11:47AM -0500, James Bottomley wrote:
> On Fri, 2023-03-03 at 03:49 +0000, Matthew Wilcox wrote:
> > On Thu, Mar 02, 2023 at 06:58:58PM -0700, Keith Busch wrote:
> > > That said, I was hoping you were going to suggest supporting 16k
> > > logical block sizes. Not a problem on some arch's, but still
> > > problematic when PAGE_SIZE is 4k. :)
> > 
> > I was hoping Luis was going to propose a session on LBA size >
> > PAGE_SIZE. Funnily, while the pressure is coming from the storage
> > vendors, I don't think there's any work to be done in the storage
> > layers.  It's purely a FS+MM problem.
> 
> Heh, I can do the fools rush in bit, especially if what we're
> interested in the minimum it would take to support this ...
> 
> The FS problem could be solved simply by saying FS block size must
> equal device block size, then it becomes purely a MM issue.

Spoken like somebody who's never converted a filesystem to
supporting large folios.  There are a number of issues:

1. The obvious; use of PAGE_SIZE and/or PAGE_SHIFT
2. Use of kmap-family to access, eg directories.  You can't kmap
   an entire folio, only one page at a time.  And if a dentry is split
   across a page boundary ...
3. buffer_heads do not currently support large folios.  Working on it.

Probably a few other things I forget.  But look through the recent
patches to AFS, CIFS, NFS, XFS, iomap that do folio conversions.
A lot of it is pretty mechanical, but some of it takes hard thought.
And if you have ideas about how to handle ext2 directories, I'm all ears.

> The MM
> issue could be solved by adding a page order attribute to struct
> address_space and insisting that pagecache/filemap functions in
> mm/filemap.c all have to operate on objects that are an integer
> multiple of the address space order.  The base allocator is
> filemap_alloc_folio, which already has an apparently always zero order
> parameter (hmmm...) and it always seems to be called from sites that
> have the address_space, so it could simply be modified to always
> operate at the address_space order.

Oh, I have a patch for that.  That's the easy part.  The hard part is
plugging your ears to the screams of the MM people who are convinced
that fragmentation will make it impossible to mount your filesystem.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-03 21:45     ` Luis Chamberlain
  2023-03-03 22:07       ` Keith Busch
  2023-03-03 23:51       ` Bart Van Assche
@ 2023-03-04 11:08       ` Hannes Reinecke
  2023-03-04 13:24         ` Javier González
  2023-03-04 16:47         ` Matthew Wilcox
  2 siblings, 2 replies; 67+ messages in thread
From: Hannes Reinecke @ 2023-03-04 11:08 UTC (permalink / raw)
  To: Luis Chamberlain, Matthew Wilcox
  Cc: Keith Busch, Theodore Ts'o, Pankaj Raghav, Daniel Gomez,
	Javier González, lsf-pc, linux-fsdevel, linux-mm,
	linux-block

On 3/3/23 22:45, Luis Chamberlain wrote:
> On Fri, Mar 03, 2023 at 03:49:29AM +0000, Matthew Wilcox wrote:
>> On Thu, Mar 02, 2023 at 06:58:58PM -0700, Keith Busch wrote:
>>> That said, I was hoping you were going to suggest supporting 16k logical block
>>> sizes. Not a problem on some arch's, but still problematic when PAGE_SIZE is
>>> 4k. :)
>>
>> I was hoping Luis was going to propose a session on LBA size > PAGE_SIZE.
>> Funnily, while the pressure is coming from the storage vendors, I don't
>> think there's any work to be done in the storage layers.  It's purely
>> a FS+MM problem.
> 
> You'd hope most of it is left to FS + MM, but I'm not yet sure that's
> quite it yet. Initial experimentation shows just enabling > PAGE_SIZE
> physical & logical block NVMe devices gets brought down to 512 bytes.
> That seems odd to say the least. Would changing this be an issue now?
> 
> I'm gathering there is generic interest in this topic though. So one
> thing we *could* do is perhaps review lay-of-the-land of interest and
> break down what we all think are things likely could be done / needed.
> At the very least we can come out together knowing the unknowns together.
> 
> I started to think about some of these things a while ago and with the
> help of Willy I tried to break down some of the items I gathered from him
> into community OKRs (super informal itemization of goals and sub tasks which
> would complete such goals) and started trying to take a stab at them
> with our team, but obviously I think it would be great if we all just
> divide & and conquer here. So maybe reviewing these and extending them
> as a community would be good:
> 
> https://kernelnewbies.org/KernelProjects/large-block-size
> 
> I'm recently interested in tmpfs so will be taking a stab at higher
> order page size support there to see what blows up.
> 
Cool.

> The other stuff like general IOMAP conversion is pretty well known, and
> we already I think have a proposed session on that. But there is also
> even smaller fish to fry, like *just* doing a baseline with some
> filesystems with 4 KiB block size seems in order.
> 
> Hearing filesystem developer's thoughts on support for larger block
> size in light of lower order PAGE_SIZE would be good, given one of the
> odd situations some distributions / teams find themselves in is trying
> to support larger block sizes but with difficult access to higher
> PAGE_SIZE systems. Are there ways to simplify this / help us in general?
> Without it's a bit hard to muck around with some of this in terms of
> support long term. This also got me thinking about ways to try to replicate
> larger IO virtual devices a bit better too. While paying a cloud
> provider to test this is one nice option, it'd be great if I can just do
> this in house with some hacks too. For virtio-blk-pci at least, for instance,
> I wondered whether using just the host page cache suffices, or would a 4K
> page cache on the host modify say a 16 k emualated io controller results
> significantly? How do we most effectively virtualize 16k controllers
> in-house?
> 
> To help with experimenting with large io and NVMe / virtio-blk-pci I
> recented added support to intantiate tons of large IO devices to kdevops
> [0], with it it should be easy to reproduce odd issues we may come up
> with. For instnace it should be possible to subsequently extend the
> kdevops fstests or blktests automation support with just a few Kconfig files
> to use some of these largio devices to see what blows up.
> 
We could implement a (virtual) zoned device, and expose each zone as a 
block. That gives us the required large block characteristics, and with
a bit of luck we might be able to dial up to really large block sizes
like the 256M sizes on current SMR drives.
ublk might be a good starting point.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-04 11:08       ` Hannes Reinecke
@ 2023-03-04 13:24         ` Javier González
  2023-03-04 16:47         ` Matthew Wilcox
  1 sibling, 0 replies; 67+ messages in thread
From: Javier González @ 2023-03-04 13:24 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Luis Chamberlain, Matthew Wilcox, Keith Busch, Theodore Ts'o,
	Pankaj Raghav, Daniel Gomez, lsf-pc, linux-fsdevel, linux-mm,
	linux-block

On 04.03.2023 12:08, Hannes Reinecke wrote:
>On 3/3/23 22:45, Luis Chamberlain wrote:
>>On Fri, Mar 03, 2023 at 03:49:29AM +0000, Matthew Wilcox wrote:
>>>On Thu, Mar 02, 2023 at 06:58:58PM -0700, Keith Busch wrote:
>>>>That said, I was hoping you were going to suggest supporting 16k logical block
>>>>sizes. Not a problem on some arch's, but still problematic when PAGE_SIZE is
>>>>4k. :)
>>>
>>>I was hoping Luis was going to propose a session on LBA size > PAGE_SIZE.
>>>Funnily, while the pressure is coming from the storage vendors, I don't
>>>think there's any work to be done in the storage layers.  It's purely
>>>a FS+MM problem.
>>
>>You'd hope most of it is left to FS + MM, but I'm not yet sure that's
>>quite it yet. Initial experimentation shows just enabling > PAGE_SIZE
>>physical & logical block NVMe devices gets brought down to 512 bytes.
>>That seems odd to say the least. Would changing this be an issue now?
>>
>>I'm gathering there is generic interest in this topic though. So one
>>thing we *could* do is perhaps review lay-of-the-land of interest and
>>break down what we all think are things likely could be done / needed.
>>At the very least we can come out together knowing the unknowns together.
>>
>>I started to think about some of these things a while ago and with the
>>help of Willy I tried to break down some of the items I gathered from him
>>into community OKRs (super informal itemization of goals and sub tasks which
>>would complete such goals) and started trying to take a stab at them
>>with our team, but obviously I think it would be great if we all just
>>divide & and conquer here. So maybe reviewing these and extending them
>>as a community would be good:
>>
>>https://protect2.fireeye.com/v1/url?k=bd8b143b-dcf6fc7c-bd8a9f74-74fe485fff30-e62d6b1f7e2b2236&q=1&e=64cdf12c-742d-4d0b-9870-bfc5c26dba21&u=https%3A%2F%2Fkernelnewbies.org%2FKernelProjects%2Flarge-block-size
>>
>>I'm recently interested in tmpfs so will be taking a stab at higher
>>order page size support there to see what blows up.
>>
>Cool.
>
>>The other stuff like general IOMAP conversion is pretty well known, and
>>we already I think have a proposed session on that. But there is also
>>even smaller fish to fry, like *just* doing a baseline with some
>>filesystems with 4 KiB block size seems in order.
>>
>>Hearing filesystem developer's thoughts on support for larger block
>>size in light of lower order PAGE_SIZE would be good, given one of the
>>odd situations some distributions / teams find themselves in is trying
>>to support larger block sizes but with difficult access to higher
>>PAGE_SIZE systems. Are there ways to simplify this / help us in general?
>>Without it's a bit hard to muck around with some of this in terms of
>>support long term. This also got me thinking about ways to try to replicate
>>larger IO virtual devices a bit better too. While paying a cloud
>>provider to test this is one nice option, it'd be great if I can just do
>>this in house with some hacks too. For virtio-blk-pci at least, for instance,
>>I wondered whether using just the host page cache suffices, or would a 4K
>>page cache on the host modify say a 16 k emualated io controller results
>>significantly? How do we most effectively virtualize 16k controllers
>>in-house?
>>
>>To help with experimenting with large io and NVMe / virtio-blk-pci I
>>recented added support to intantiate tons of large IO devices to kdevops
>>[0], with it it should be easy to reproduce odd issues we may come up
>>with. For instnace it should be possible to subsequently extend the
>>kdevops fstests or blktests automation support with just a few Kconfig files
>>to use some of these largio devices to see what blows up.
>>
>We could implement a (virtual) zoned device, and expose each zone as a
>block. That gives us the required large block characteristics, and
>with
>a bit of luck we might be able to dial up to really large block sizes
>like the 256M sizes on current SMR drives.

Why would we want to deal with the overhead of the zoned block device
for a generic large block implementation?

I can see how this is useful for block devices, but it seems to me that
they would be users of this instead.

The idea would be for NVMe devices to report a LBA format with a LBA
size > 4KB.

Am I missing something?

>ublk might be a good starting point.

Similarly, I would see ublk as a user of this support, where the
underlying device is > 4KB.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-04  7:34       ` Matthew Wilcox
@ 2023-03-04 13:41         ` James Bottomley
  2023-03-04 16:39           ` Matthew Wilcox
  2023-03-04 19:04         ` Luis Chamberlain
  1 sibling, 1 reply; 67+ messages in thread
From: James Bottomley @ 2023-03-04 13:41 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Keith Busch, Luis Chamberlain, Theodore Ts'o, lsf-pc,
	linux-fsdevel, linux-mm, linux-block

On Sat, 2023-03-04 at 07:34 +0000, Matthew Wilcox wrote:
> On Fri, Mar 03, 2023 at 08:11:47AM -0500, James Bottomley wrote:
> > On Fri, 2023-03-03 at 03:49 +0000, Matthew Wilcox wrote:
> > > On Thu, Mar 02, 2023 at 06:58:58PM -0700, Keith Busch wrote:
> > > > That said, I was hoping you were going to suggest supporting
> > > > 16k logical block sizes. Not a problem on some arch's, but
> > > > still problematic when PAGE_SIZE is 4k. :)
> > > 
> > > I was hoping Luis was going to propose a session on LBA size >
> > > PAGE_SIZE. Funnily, while the pressure is coming from the storage
> > > vendors, I don't think there's any work to be done in the storage
> > > layers.  It's purely a FS+MM problem.
> > 
> > Heh, I can do the fools rush in bit, especially if what we're
> > interested in the minimum it would take to support this ...
> > 
> > The FS problem could be solved simply by saying FS block size must
> > equal device block size, then it becomes purely a MM issue.
> 
> Spoken like somebody who's never converted a filesystem to
> supporting large folios.  There are a number of issues:
> 
> 1. The obvious; use of PAGE_SIZE and/or PAGE_SHIFT

Well, yes, a filesystem has to be aware it's using a block size larger
than page size.

> 2. Use of kmap-family to access, eg directories.  You can't kmap
>    an entire folio, only one page at a time.  And if a dentry is
> split across a page boundary ...

Is kmap relevant?  It's only used for reading user pages in the kernel
and I can't see why a filesystem would use it unless it wants to pack
inodes into pages that also contain user data, which is an optimization
not a fundamental issue (although I grant that as the blocksize grows
it becomes more useful) so it doesn't have to be part of the minimum
viable prototype.

> 3. buffer_heads do not currently support large folios.  Working on
> it.

Yes, I always forget filesystems still use the buffer cache.  But
fundamentally the buffer_head structure can cope with buffers that span
pages so most of the logic changes would be around grow_dev_page().  It
seems somewhat messy but not too hard.

> Probably a few other things I forget.  But look through the recent
> patches to AFS, CIFS, NFS, XFS, iomap that do folio conversions.
> A lot of it is pretty mechanical, but some of it takes hard thought.
> And if you have ideas about how to handle ext2 directories, I'm all
> ears.

OK, so I can see you were waiting for someone to touch a nerve, but if
I can go back to the stated goal, I never really thought *every*
filesystem would be suitable for block size > page size, so simply
getting a few of the modern ones working would be good enough for the
minimum viable prototype.

> 
> > The MM issue could be solved by adding a page order attribute to
> > struct address_space and insisting that pagecache/filemap functions
> > in mm/filemap.c all have to operate on objects that are an integer
> > multiple of the address space order.  The base allocator is
> > filemap_alloc_folio, which already has an apparently always zero
> > order parameter (hmmm...) and it always seems to be called from
> > sites that
> > have the address_space, so it could simply be modified to always
> > operate at the address_space order.
> 
> Oh, I have a patch for that.  That's the easy part.  The hard part is
> plugging your ears to the screams of the MM people who are convinced
> that fragmentation will make it impossible to mount your filesystem.

Right, so if the MM issue is solved it's picking a first FS for
conversion and solving the buffer problem.

I fully understand that eventually we'll need to get a single large
buffer to span discontiguous pages ... I noted that in the bit you cut,
but I don't see why the prototype shouldn't start with contiguous
pages.

James


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-04 13:41         ` James Bottomley
@ 2023-03-04 16:39           ` Matthew Wilcox
  2023-03-05  4:15             ` Luis Chamberlain
  2023-03-06  3:50             ` James Bottomley
  0 siblings, 2 replies; 67+ messages in thread
From: Matthew Wilcox @ 2023-03-04 16:39 UTC (permalink / raw)
  To: James Bottomley
  Cc: Keith Busch, Luis Chamberlain, Theodore Ts'o, lsf-pc,
	linux-fsdevel, linux-mm, linux-block

On Sat, Mar 04, 2023 at 08:41:04AM -0500, James Bottomley wrote:
> On Sat, 2023-03-04 at 07:34 +0000, Matthew Wilcox wrote:
> > On Fri, Mar 03, 2023 at 08:11:47AM -0500, James Bottomley wrote:
> > > On Fri, 2023-03-03 at 03:49 +0000, Matthew Wilcox wrote:
> > > > On Thu, Mar 02, 2023 at 06:58:58PM -0700, Keith Busch wrote:
> > > > > That said, I was hoping you were going to suggest supporting
> > > > > 16k logical block sizes. Not a problem on some arch's, but
> > > > > still problematic when PAGE_SIZE is 4k. :)
> > > > 
> > > > I was hoping Luis was going to propose a session on LBA size >
> > > > PAGE_SIZE. Funnily, while the pressure is coming from the storage
> > > > vendors, I don't think there's any work to be done in the storage
> > > > layers.  It's purely a FS+MM problem.
> > > 
> > > Heh, I can do the fools rush in bit, especially if what we're
> > > interested in the minimum it would take to support this ...
> > > 
> > > The FS problem could be solved simply by saying FS block size must
> > > equal device block size, then it becomes purely a MM issue.
> > 
> > Spoken like somebody who's never converted a filesystem to
> > supporting large folios.  There are a number of issues:
> > 
> > 1. The obvious; use of PAGE_SIZE and/or PAGE_SHIFT
> 
> Well, yes, a filesystem has to be aware it's using a block size larger
> than page size.
> 
> > 2. Use of kmap-family to access, eg directories.  You can't kmap
> >    an entire folio, only one page at a time.  And if a dentry is
> > split across a page boundary ...
> 
> Is kmap relevant?  It's only used for reading user pages in the kernel
> and I can't see why a filesystem would use it unless it wants to pack
> inodes into pages that also contain user data, which is an optimization
> not a fundamental issue (although I grant that as the blocksize grows
> it becomes more useful) so it doesn't have to be part of the minimum
> viable prototype.

Filesystems often choose to store their metadata in HIGHMEM.  This wasn't
an entirely crazy idea back in, say, 2005, when you might be running
an ext2 filesystem on a machine with 32GB of RAM, and only 800MB of
address space for it.

Now it's silly.  Buy a real computer.  I'm getting more and more
comfortable with the idea that "Linux doesn't support block sizes >
PAGE_SIZE on 32-bit machines" is an acceptable answer.

> > 3. buffer_heads do not currently support large folios.  Working on
> > it.
> 
> Yes, I always forget filesystems still use the buffer cache.  But
> fundamentally the buffer_head structure can cope with buffers that span
> pages so most of the logic changes would be around grow_dev_page().  It
> seems somewhat messy but not too hard.

I forgot one particularly nasty case; we have filesystems (including the
mpage code used by a number of filesystems) which put an array of block
numbers on the stack.  Not a big deal when that's 8 entries (4kB/512 * 8
bytes = 64 bytes), but it starts to get noticable at 64kB PAGE_SIZE (1kB
is a little large for a stack allocation) and downright unreasonable
if you try to do something to a 2MB allocation (32kB).

> > Probably a few other things I forget.  But look through the recent
> > patches to AFS, CIFS, NFS, XFS, iomap that do folio conversions.
> > A lot of it is pretty mechanical, but some of it takes hard thought.
> > And if you have ideas about how to handle ext2 directories, I'm all
> > ears.
> 
> OK, so I can see you were waiting for someone to touch a nerve, but if
> I can go back to the stated goal, I never really thought *every*
> filesystem would be suitable for block size > page size, so simply
> getting a few of the modern ones working would be good enough for the
> minimum viable prototype.

XFS already works with arbitrary-order folios.  The only needed piece is
specifying to the VFS that there's a minimum order for this particular
inode, and having the VFS honour that everywhere.

What "touches a nerve" is people who clearly haven't been paying attention
to the problem making sweeping assertions about what the easy and hard
parts are.

> I fully understand that eventually we'll need to get a single large
> buffer to span discontiguous pages ... I noted that in the bit you cut,
> but I don't see why the prototype shouldn't start with contiguous
> pages.

I disagree that this is a desirable goal.  To solve the scalability
issues we have in the VFS, we need to manage memory in larger chunks
than PAGE_SIZE.  That makes the concerns expressed in previous years moot.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-04 11:08       ` Hannes Reinecke
  2023-03-04 13:24         ` Javier González
@ 2023-03-04 16:47         ` Matthew Wilcox
  2023-03-04 17:17           ` Hannes Reinecke
  1 sibling, 1 reply; 67+ messages in thread
From: Matthew Wilcox @ 2023-03-04 16:47 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Luis Chamberlain, Keith Busch, Theodore Ts'o, Pankaj Raghav,
	Daniel Gomez, Javier González, lsf-pc, linux-fsdevel,
	linux-mm, linux-block

On Sat, Mar 04, 2023 at 12:08:36PM +0100, Hannes Reinecke wrote:
> We could implement a (virtual) zoned device, and expose each zone as a
> block. That gives us the required large block characteristics, and with
> a bit of luck we might be able to dial up to really large block sizes
> like the 256M sizes on current SMR drives.
> ublk might be a good starting point.

Ummmm.  Is supporting 256MB block sizes really a desired goal?  I suggest
that is far past the knee of the curve; if we can only write 256MB chunks
as a single entity, we're looking more at a filesystem redesign than we
are at making filesystems and the MM support 256MB size blocks.

The current work is all going towards tracking memory in larger chunks,
so writing back, eg, 64kB chunks of the file.  But if 256MB is where
we're going, we need to be thinking more like a RAID device and
accumulating writes into a log that we can then blast out in a single
giant write.

fsync() and O_SYNC is going to be painful for that kind of device.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-04 16:47         ` Matthew Wilcox
@ 2023-03-04 17:17           ` Hannes Reinecke
  2023-03-04 17:54             ` Matthew Wilcox
  0 siblings, 1 reply; 67+ messages in thread
From: Hannes Reinecke @ 2023-03-04 17:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Keith Busch, Theodore Ts'o, Pankaj Raghav,
	Daniel Gomez, Javier González, lsf-pc, linux-fsdevel,
	linux-mm, linux-block

On 3/4/23 17:47, Matthew Wilcox wrote:
> On Sat, Mar 04, 2023 at 12:08:36PM +0100, Hannes Reinecke wrote:
>> We could implement a (virtual) zoned device, and expose each zone as a
>> block. That gives us the required large block characteristics, and with
>> a bit of luck we might be able to dial up to really large block sizes
>> like the 256M sizes on current SMR drives.
>> ublk might be a good starting point.
> 
> Ummmm.  Is supporting 256MB block sizes really a desired goal?  I suggest
> that is far past the knee of the curve; if we can only write 256MB chunks
> as a single entity, we're looking more at a filesystem redesign than we
> are at making filesystems and the MM support 256MB size blocks.
> 
Naa, not really. It _would_ be cool as we could get rid of all the 
cludges which have nowadays re sequential writes.
And, remember, 256M is just a number someone thought to be a good 
compromise. If we end up with a lower number (16M?) we might be able
to convince the powers that be to change their zone size.
Heck, with 16M block size there wouldn't be a _need_ for zones in
the first place.

But yeah, 256M is excessive. Initially I would shoot for something
like 2M.

> The current work is all going towards tracking memory in larger chunks,
> so writing back, eg, 64kB chunks of the file.  But if 256MB is where
> we're going, we need to be thinking more like a RAID device and
> accumulating writes into a log that we can then blast out in a single
> giant write.
> 
Yeah. I _do_ remember someone hch-ish presenting something two years
back at ALPSS, but guess that's still on the back-burner.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-04 17:17           ` Hannes Reinecke
@ 2023-03-04 17:54             ` Matthew Wilcox
  2023-03-04 18:53               ` Luis Chamberlain
                                 ` (2 more replies)
  0 siblings, 3 replies; 67+ messages in thread
From: Matthew Wilcox @ 2023-03-04 17:54 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Luis Chamberlain, Keith Busch, Theodore Ts'o, Pankaj Raghav,
	Daniel Gomez, Javier González, lsf-pc, linux-fsdevel,
	linux-mm, linux-block

On Sat, Mar 04, 2023 at 06:17:35PM +0100, Hannes Reinecke wrote:
> On 3/4/23 17:47, Matthew Wilcox wrote:
> > On Sat, Mar 04, 2023 at 12:08:36PM +0100, Hannes Reinecke wrote:
> > > We could implement a (virtual) zoned device, and expose each zone as a
> > > block. That gives us the required large block characteristics, and with
> > > a bit of luck we might be able to dial up to really large block sizes
> > > like the 256M sizes on current SMR drives.
> > > ublk might be a good starting point.
> > 
> > Ummmm.  Is supporting 256MB block sizes really a desired goal?  I suggest
> > that is far past the knee of the curve; if we can only write 256MB chunks
> > as a single entity, we're looking more at a filesystem redesign than we
> > are at making filesystems and the MM support 256MB size blocks.
> > 
> Naa, not really. It _would_ be cool as we could get rid of all the cludges
> which have nowadays re sequential writes.
> And, remember, 256M is just a number someone thought to be a good
> compromise. If we end up with a lower number (16M?) we might be able
> to convince the powers that be to change their zone size.
> Heck, with 16M block size there wouldn't be a _need_ for zones in
> the first place.
> 
> But yeah, 256M is excessive. Initially I would shoot for something
> like 2M.

I think we're talking about different things (probably different storage
vendors want different things, or even different people at the same
storage vendor want different things).

Luis and I are talking about larger LBA sizes.  That is, the minimum
read/write size from the block device is 16kB or 64kB or whatever.
In this scenario, the minimum amount of space occupied by a file goes
up from 512 bytes or 4kB to 64kB.  That's doable, even if somewhat
suboptimal.

Your concern seems to be more around shingled devices (or their equivalent
in SSD terms) where there are large zones which are append-only, but
you can still random-read 512 byte LBAs.  I think there are different
solutions to these problems, and people are working on both of these
problems.

But if storage vendors are really pushing for 256MB LBAs, then that's
going to need a third kind of solution, and I'm not aware of anyone
working on that.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-04 17:54             ` Matthew Wilcox
@ 2023-03-04 18:53               ` Luis Chamberlain
  2023-03-05  3:06               ` Damien Le Moal
  2023-03-05 11:22               ` Hannes Reinecke
  2 siblings, 0 replies; 67+ messages in thread
From: Luis Chamberlain @ 2023-03-04 18:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hannes Reinecke, Keith Busch, Theodore Ts'o, Pankaj Raghav,
	Daniel Gomez, Javier González, Klaus Jensen, lsf-pc,
	linux-fsdevel, linux-mm, linux-block

On Sat, Mar 04, 2023 at 05:54:38PM +0000, Matthew Wilcox wrote:
> On Sat, Mar 04, 2023 at 06:17:35PM +0100, Hannes Reinecke wrote:
> > On 3/4/23 17:47, Matthew Wilcox wrote:
> > > On Sat, Mar 04, 2023 at 12:08:36PM +0100, Hannes Reinecke wrote:
> > > > We could implement a (virtual) zoned device, and expose each zone as a
> > > > block. That gives us the required large block characteristics, and with
> > > > a bit of luck we might be able to dial up to really large block sizes
> > > > like the 256M sizes on current SMR drives.
> > > > ublk might be a good starting point.
> > > 
> > > Ummmm.  Is supporting 256MB block sizes really a desired goal?  I suggest
> > > that is far past the knee of the curve; if we can only write 256MB chunks
> > > as a single entity, we're looking more at a filesystem redesign than we
> > > are at making filesystems and the MM support 256MB size blocks.
> > > 
> > Naa, not really. It _would_ be cool as we could get rid of all the cludges
> > which have nowadays re sequential writes.
> > And, remember, 256M is just a number someone thought to be a good
> > compromise. If we end up with a lower number (16M?) we might be able
> > to convince the powers that be to change their zone size.
> > Heck, with 16M block size there wouldn't be a _need_ for zones in
> > the first place.
> > 
> > But yeah, 256M is excessive. Initially I would shoot for something
> > like 2M.
> 
> I think we're talking about different things (probably different storage
> vendors want different things, or even different people at the same
> storage vendor want different things).
> 
> Luis and I are talking about larger LBA sizes.  That is, the minimum
> read/write size from the block device is 16kB or 64kB or whatever.
> In this scenario, the minimum amount of space occupied by a file goes
> up from 512 bytes or 4kB to 64kB.  That's doable, even if somewhat
> suboptimal.

Yes.

> Your concern seems to be more around shingled devices (or their equivalent
> in SSD terms) where there are large zones which are append-only, but
> you can still random-read 512 byte LBAs.  I think there are different
> solutions to these problems, and people are working on both of these
> problems.
> 
> But if storage vendors are really pushing for 256MB LBAs, then that's
> going to need a third kind of solution, and I'm not aware of anyone
> working on that.

Hannes had replied to my suggestion about a way to *virtualize* *optimally*
a real storage controller with larger LBA, in that thread I was hinting to 
avoid using on the hypervisor cache=passthrough and instead use something
like cache=writeback or even cache=unsafe for experimentation for
virtio-blk-pci. For a more elaborate description of these see [0] but the
skinny is cache=writeback uses the host stroage controller while the
other rely on the host page cache.

The overhead of latencies incurred by anything to replicate larger LBAs
should be mitigated, so I don't think using a zone storage zone for it
would be good.

I was asking whether or not experimenting with a different host page cache
PAGE_SIZE might help replicate things more a bit realistically, even if
if was suboptimal for the host for the reasons previously noted as stupid.

If sticking to PAGE_SIZE on the host another idea may be to use tmpfs +
huge pages so to at least mitigate TLB lookups.

[0] https://github.com/linux-kdevops/kdevops/commit/94844c4684a51997cb327d2fb0ce491fe4429dfc

  Luis

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-04  7:34       ` Matthew Wilcox
  2023-03-04 13:41         ` James Bottomley
@ 2023-03-04 19:04         ` Luis Chamberlain
  1 sibling, 0 replies; 67+ messages in thread
From: Luis Chamberlain @ 2023-03-04 19:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: James Bottomley, Keith Busch, Theodore Ts'o,
	Javier González, Pankaj Raghav, Daniel Gomez, lsf-pc,
	linux-fsdevel, linux-mm, linux-block

On Sat, Mar 04, 2023 at 07:34:33AM +0000, Matthew Wilcox wrote:
> The hard part is plugging your ears to the screams of the MM people
> who are convinced that fragmentation will make it impossible to mount
> your filesystem.

One doesn't just need to plug your ears, one can also be prepared for that,
should that actually end up being true, because frankly we don't have
the evidence yet. And it's something I have slowly started to think about --
because -- why not be ready?

In fact let's say the inverse is true, having the tooling to proove them
wrong is also a desirable outcome and that begs the question of proper
tooling to measure this, etc. Something probably more for an MM track.
What would satifsy proof and what tooling / metrics used?

It is *not* something that only is implicated by storage IO controllers
and so what we're looking at a generic device issue / concern for memory
fragmentation.

*If* the generalization of huge page uses for something like bpf-prog-pack ends
up materializing and we end up using it for even *all* module .text,
*then* I *think* it something similar be a way to address that concern
for devices with huge pages for CMA. This is one area where I think
device hints for large IO might come in handy, we can limit such
dedicated pools to only devices with hints and limit the amount of huge
pages used for this purpose.

But ask me 2 kernel releases from now again.

  Luis

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-04 17:54             ` Matthew Wilcox
  2023-03-04 18:53               ` Luis Chamberlain
@ 2023-03-05  3:06               ` Damien Le Moal
  2023-03-05 11:22               ` Hannes Reinecke
  2 siblings, 0 replies; 67+ messages in thread
From: Damien Le Moal @ 2023-03-05  3:06 UTC (permalink / raw)
  To: Matthew Wilcox, Hannes Reinecke
  Cc: Luis Chamberlain, Keith Busch, Theodore Ts'o, Pankaj Raghav,
	Daniel Gomez, Javier González, lsf-pc, linux-fsdevel,
	linux-mm, linux-block

On 3/5/23 02:54, Matthew Wilcox wrote:
> On Sat, Mar 04, 2023 at 06:17:35PM +0100, Hannes Reinecke wrote:
>> On 3/4/23 17:47, Matthew Wilcox wrote:
>>> On Sat, Mar 04, 2023 at 12:08:36PM +0100, Hannes Reinecke wrote:
>>>> We could implement a (virtual) zoned device, and expose each zone as a
>>>> block. That gives us the required large block characteristics, and with
>>>> a bit of luck we might be able to dial up to really large block sizes
>>>> like the 256M sizes on current SMR drives.
>>>> ublk might be a good starting point.
>>>
>>> Ummmm.  Is supporting 256MB block sizes really a desired goal?  I suggest
>>> that is far past the knee of the curve; if we can only write 256MB chunks
>>> as a single entity, we're looking more at a filesystem redesign than we
>>> are at making filesystems and the MM support 256MB size blocks.
>>>
>> Naa, not really. It _would_ be cool as we could get rid of all the cludges
>> which have nowadays re sequential writes.
>> And, remember, 256M is just a number someone thought to be a good
>> compromise. If we end up with a lower number (16M?) we might be able
>> to convince the powers that be to change their zone size.
>> Heck, with 16M block size there wouldn't be a _need_ for zones in
>> the first place.
>>
>> But yeah, 256M is excessive. Initially I would shoot for something
>> like 2M.
> 
> I think we're talking about different things (probably different storage
> vendors want different things, or even different people at the same
> storage vendor want different things).
> 
> Luis and I are talking about larger LBA sizes.  That is, the minimum
> read/write size from the block device is 16kB or 64kB or whatever.
> In this scenario, the minimum amount of space occupied by a file goes
> up from 512 bytes or 4kB to 64kB.  That's doable, even if somewhat
> suboptimal.

FYI, that is already out there, even though hidden from the host for backward
compatibility reasons. Example: WD SMR drives use 64K distributed sectors, which
is essentially 16 4KB sectors stripped together to achieve stronger ECC).

C.f. Distributed sector format (DSEC):
https://documents.westerndigital.com/content/dam/doc-library/en_us/assets/public/western-digital/collateral/tech-brief/tech-brief-ultrasmr-technology.pdf

This is hidden to the host though, and the LBA remains 512B or 4KB. This however
does result in measurable impact on IOPS with small reads as a sub-64K read
needs to be internally processed as a 64KB read to get the entire DSEC. The drop
in performance is not dramatic: about 5% lower IOPS compared to an equivalent
drive without DSEC. Still, that matters considering HDD IO density issues
(IOPS/TB) but in the case of SMR, that is part of the increased capacity trade-off.

So exposing the DSEC directly as the LBA size is not a stretch for the HDD FW,
as long as the host supports that. There are no plans to do so though, but we
could try experimenting.

For host side experimentation, something like qemu/nvme device emulation or
tcmu-runner for scsi devices, should be able to allow emulating large block size
fairly easily.

> 
> Your concern seems to be more around shingled devices (or their equivalent
> in SSD terms) where there are large zones which are append-only, but
> you can still random-read 512 byte LBAs.  I think there are different
> solutions to these problems, and people are working on both of these
> problems.

The above example does show that the device can generally implement emulation of
smaller LBA even with an internally larger read/write size unit. Having that
larger size unit advertised as the optimal IO size alignment (as it should) and
being more diligent in having FSes & mm use that may be a good approach too.

> 
> But if storage vendors are really pushing for 256MB LBAs, then that's
> going to need a third kind of solution, and I'm not aware of anyone
> working on that.

No we are not pushing for such crazy numbers :)
And for SMR case, smaller zone sizes are not desired as small zone size leads to
more real estate waste on the HDD platters, so lower total capacity (not desired
given that SMR is all about getting higher capacity "for free").


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-04 16:39           ` Matthew Wilcox
@ 2023-03-05  4:15             ` Luis Chamberlain
  2023-03-05  5:02               ` Matthew Wilcox
  2023-03-06 12:04               ` Hannes Reinecke
  2023-03-06  3:50             ` James Bottomley
  1 sibling, 2 replies; 67+ messages in thread
From: Luis Chamberlain @ 2023-03-05  4:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: James Bottomley, Keith Busch, Theodore Ts'o, lsf-pc,
	linux-fsdevel, linux-mm, linux-block

On Sat, Mar 04, 2023 at 04:39:02PM +0000, Matthew Wilcox wrote:
> I'm getting more and more
> comfortable with the idea that "Linux doesn't support block sizes >
> PAGE_SIZE on 32-bit machines" is an acceptable answer.

First of all filesystems would need to add support for a larger block
sizes > PAGE_SIZE, and that takes effort. It is also a support question
too.

I think garnering consensus from filesystem developers we don't want
to support block sizes > PAGE_SIZE on 32-bit systems would be a good
thing to review at LSFMM or even on this list. I hightly doubt anyone
is interested in that support.

> XFS already works with arbitrary-order folios. 

But block sizes > PAGE_SIZE is work which is still not merged. It
*can* be with time. That would allow one to muck with larger block
sizes than 4k on x86-64 for instance. Without this, you can't play
ball.

> The only needed piece is
> specifying to the VFS that there's a minimum order for this particular
> inode, and having the VFS honour that everywhere.

Other than the above too, don't we still also need to figure out what
fs APIs would incur larger order folios? And then what about corner cases
with the page cache?

I was hoping some of these nooks and crannies could be explored with tmpfs.

  Luis

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-05  4:15             ` Luis Chamberlain
@ 2023-03-05  5:02               ` Matthew Wilcox
  2023-03-08  6:11                 ` Luis Chamberlain
  2023-03-06 12:04               ` Hannes Reinecke
  1 sibling, 1 reply; 67+ messages in thread
From: Matthew Wilcox @ 2023-03-05  5:02 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: James Bottomley, Keith Busch, Theodore Ts'o, lsf-pc,
	linux-fsdevel, linux-mm, linux-block

On Sat, Mar 04, 2023 at 08:15:50PM -0800, Luis Chamberlain wrote:
> On Sat, Mar 04, 2023 at 04:39:02PM +0000, Matthew Wilcox wrote:
> > I'm getting more and more
> > comfortable with the idea that "Linux doesn't support block sizes >
> > PAGE_SIZE on 32-bit machines" is an acceptable answer.
> 
> First of all filesystems would need to add support for a larger block
> sizes > PAGE_SIZE, and that takes effort. It is also a support question
> too.
> 
> I think garnering consensus from filesystem developers we don't want
> to support block sizes > PAGE_SIZE on 32-bit systems would be a good
> thing to review at LSFMM or even on this list. I hightly doubt anyone
> is interested in that support.

Agreed.

> > XFS already works with arbitrary-order folios. 
> 
> But block sizes > PAGE_SIZE is work which is still not merged. It
> *can* be with time. That would allow one to muck with larger block
> sizes than 4k on x86-64 for instance. Without this, you can't play
> ball.

Do you mean that XFS is checking that fs block size <= PAGE_SIZE and
that check needs to be dropped?  If so, I don't see where that happens.

Or do you mean that the blockdev "filesystem" needs to be enhanced to
support large folios?  That's going to be kind of a pain because it
uses buffer_heads.  And ext4 depends on it using buffer_heads.  So,
yup, more work needed than I remembered (but as I said, it's FS side,
not block layer or driver work).

Or were you referring to the NVMe PAGE_SIZE sanity check that Keith
mentioned upthread?

> > The only needed piece is
> > specifying to the VFS that there's a minimum order for this particular
> > inode, and having the VFS honour that everywhere.
> 
> Other than the above too, don't we still also need to figure out what
> fs APIs would incur larger order folios? And then what about corner cases
> with the page cache?
> 
> I was hoping some of these nooks and crannies could be explored with tmpfs.

I think we're exploring all those with XFS.  Or at least, many of
them.  A lot of the folio conversion patches you see flowing past
are pure efficiency gains -- no need to convert between pages and
folios implicitly; do the explicit conversions and save instructions.
Most of the correctness issues were found & fixed a long time ago when
PMD support was added to tmpfs.  One notable exception would be the
writeback path since tmpfs doesn't writeback, it has that special thing
it does with swap.

tmpfs is a rather special case as far as its use of the filesystem APIs
go, but I suspect I've done most of the needed work to have it work with
arbitrary order folios instead of just PTE and PMD sizes.  There's
probably some left-over assumptions that I didn't find yet.  Maybe in
the swap path, for example ;-)

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-04 17:54             ` Matthew Wilcox
  2023-03-04 18:53               ` Luis Chamberlain
  2023-03-05  3:06               ` Damien Le Moal
@ 2023-03-05 11:22               ` Hannes Reinecke
  2023-03-06  8:23                 ` Matthew Wilcox
                                   ` (2 more replies)
  2 siblings, 3 replies; 67+ messages in thread
From: Hannes Reinecke @ 2023-03-05 11:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Keith Busch, Theodore Ts'o, Pankaj Raghav,
	Daniel Gomez, Javier González, lsf-pc, linux-fsdevel,
	linux-mm, linux-block

On 3/4/23 18:54, Matthew Wilcox wrote:
> On Sat, Mar 04, 2023 at 06:17:35PM +0100, Hannes Reinecke wrote:
>> On 3/4/23 17:47, Matthew Wilcox wrote:
>>> On Sat, Mar 04, 2023 at 12:08:36PM +0100, Hannes Reinecke wrote:
>>>> We could implement a (virtual) zoned device, and expose each zone as a
>>>> block. That gives us the required large block characteristics, and with
>>>> a bit of luck we might be able to dial up to really large block sizes
>>>> like the 256M sizes on current SMR drives.
>>>> ublk might be a good starting point.
>>>
>>> Ummmm.  Is supporting 256MB block sizes really a desired goal?  I suggest
>>> that is far past the knee of the curve; if we can only write 256MB chunks
>>> as a single entity, we're looking more at a filesystem redesign than we
>>> are at making filesystems and the MM support 256MB size blocks.
>>>
>> Naa, not really. It _would_ be cool as we could get rid of all the cludges
>> which have nowadays re sequential writes.
>> And, remember, 256M is just a number someone thought to be a good
>> compromise. If we end up with a lower number (16M?) we might be able
>> to convince the powers that be to change their zone size.
>> Heck, with 16M block size there wouldn't be a _need_ for zones in
>> the first place.
>>
>> But yeah, 256M is excessive. Initially I would shoot for something
>> like 2M.
> 
> I think we're talking about different things (probably different storage
> vendors want different things, or even different people at the same
> storage vendor want different things).
> 
> Luis and I are talking about larger LBA sizes.  That is, the minimum
> read/write size from the block device is 16kB or 64kB or whatever.
> In this scenario, the minimum amount of space occupied by a file goes
> up from 512 bytes or 4kB to 64kB.  That's doable, even if somewhat
> suboptimal.
> 
And so do I. One can view zones as really large LBAs.

Indeed it might be suboptimal from the OS point of view.
But from the device point of view it won't.
And, in fact, with devices becoming faster and faster the question is
whether sticking with relatively small sectors won't become a limiting 
factor eventually.

> Your concern seems to be more around shingled devices (or their equivalent
> in SSD terms) where there are large zones which are append-only, but
> you can still random-read 512 byte LBAs.  I think there are different
> solutions to these problems, and people are working on both of these
> problems.
> 
My point being that zones are just there because the I/O stack can only 
deal with sectors up to 4k. If the I/O stack would be capable of dealing
with larger LBAs one could identify a zone with an LBA, and the entire 
issue of append-only and sequential writes would be moot.
Even the entire concept of zones becomes irrelevant as the OS would 
trivially only write entire zones.

> But if storage vendors are really pushing for 256MB LBAs, then that's
> going to need a third kind of solution, and I'm not aware of anyone
> working on that.

What I was saying is that 256M is not set in stone. It's just a 
compromise vendors used. Even if in the course of development we arrive
at a lower number of max LBA we can handle (say, 2MB) I am pretty
sure vendors will be quite interested in that.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-04 16:39           ` Matthew Wilcox
  2023-03-05  4:15             ` Luis Chamberlain
@ 2023-03-06  3:50             ` James Bottomley
  1 sibling, 0 replies; 67+ messages in thread
From: James Bottomley @ 2023-03-06  3:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Keith Busch, Luis Chamberlain, Theodore Ts'o, lsf-pc,
	linux-fsdevel, linux-mm, linux-block

On Sat, 2023-03-04 at 16:39 +0000, Matthew Wilcox wrote:
> > I fully understand that eventually we'll need to get a single large
> > buffer to span discontiguous pages ... I noted that in the bit you
> > cut, but I don't see why the prototype shouldn't start with
> > contiguous pages.
> 
> I disagree that this is a desirable goal.  To solve the scalability
> issues we have in the VFS, we need to manage memory in larger chunks
> than PAGE_SIZE.  That makes the concerns expressed in previous years
> moot.

Well, what is or isn't desirable in this regard can be left to a later
exploration.  Most of the cloud storage problems seem to be solved with
a 16k block size, for which I think we'll find current compaction is
good enough.  I actually think we might not have a current cloud use
case beyond 64k sectors.

James


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-05 11:22               ` Hannes Reinecke
@ 2023-03-06  8:23                 ` Matthew Wilcox
  2023-03-06 10:05                   ` Hannes Reinecke
  2023-03-06 16:12                   ` Theodore Ts'o
  2023-03-08 19:35                 ` Luis Chamberlain
  2023-03-08 19:55                 ` Bart Van Assche
  2 siblings, 2 replies; 67+ messages in thread
From: Matthew Wilcox @ 2023-03-06  8:23 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Luis Chamberlain, Keith Busch, Theodore Ts'o, Pankaj Raghav,
	Daniel Gomez, Javier González, lsf-pc, linux-fsdevel,
	linux-mm, linux-block

On Sun, Mar 05, 2023 at 12:22:15PM +0100, Hannes Reinecke wrote:
> On 3/4/23 18:54, Matthew Wilcox wrote:
> > I think we're talking about different things (probably different storage
> > vendors want different things, or even different people at the same
> > storage vendor want different things).
> > 
> > Luis and I are talking about larger LBA sizes.  That is, the minimum
> > read/write size from the block device is 16kB or 64kB or whatever.
> > In this scenario, the minimum amount of space occupied by a file goes
> > up from 512 bytes or 4kB to 64kB.  That's doable, even if somewhat
> > suboptimal.
> > 
> And so do I. One can view zones as really large LBAs.
> 
> Indeed it might be suboptimal from the OS point of view.
> But from the device point of view it won't.
> And, in fact, with devices becoming faster and faster the question is
> whether sticking with relatively small sectors won't become a limiting
> factor eventually.
> 
> > Your concern seems to be more around shingled devices (or their equivalent
> > in SSD terms) where there are large zones which are append-only, but
> > you can still random-read 512 byte LBAs.  I think there are different
> > solutions to these problems, and people are working on both of these
> > problems.
> > 
> My point being that zones are just there because the I/O stack can only deal
> with sectors up to 4k. If the I/O stack would be capable of dealing
> with larger LBAs one could identify a zone with an LBA, and the entire issue
> of append-only and sequential writes would be moot.
> Even the entire concept of zones becomes irrelevant as the OS would
> trivially only write entire zones.

All current filesystems that I'm aware of require their fs block size
to be >= LBA size.  That is, you can't take a 512-byte blocksize ext2
filesystem and put it on a 4kB LBA storage device.

That means that files can only grow/shrink in 256MB increments.  I
don't think that amount of wasted space is going to be acceptable.
So if we're serious about going down this path, we need to tell
filesystem people to start working out how to support fs block
size < LBA size.

That's a big ask, so let's be sure storage vendors actually want
this.  Both supporting zoned devices & suporting 16k/64k block
sizes are easier asks.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-06  8:23                 ` Matthew Wilcox
@ 2023-03-06 10:05                   ` Hannes Reinecke
  2023-03-06 16:12                   ` Theodore Ts'o
  1 sibling, 0 replies; 67+ messages in thread
From: Hannes Reinecke @ 2023-03-06 10:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Keith Busch, Theodore Ts'o, Pankaj Raghav,
	Daniel Gomez, Javier González, lsf-pc, linux-fsdevel,
	linux-mm, linux-block

On 3/6/23 09:23, Matthew Wilcox wrote:
> On Sun, Mar 05, 2023 at 12:22:15PM +0100, Hannes Reinecke wrote:
>> On 3/4/23 18:54, Matthew Wilcox wrote:
>>> I think we're talking about different things (probably different storage
>>> vendors want different things, or even different people at the same
>>> storage vendor want different things).
>>>
>>> Luis and I are talking about larger LBA sizes.  That is, the minimum
>>> read/write size from the block device is 16kB or 64kB or whatever.
>>> In this scenario, the minimum amount of space occupied by a file goes
>>> up from 512 bytes or 4kB to 64kB.  That's doable, even if somewhat
>>> suboptimal.
>>>
>> And so do I. One can view zones as really large LBAs.
>>
>> Indeed it might be suboptimal from the OS point of view.
>> But from the device point of view it won't.
>> And, in fact, with devices becoming faster and faster the question is
>> whether sticking with relatively small sectors won't become a limiting
>> factor eventually.
>>
>>> Your concern seems to be more around shingled devices (or their equivalent
>>> in SSD terms) where there are large zones which are append-only, but
>>> you can still random-read 512 byte LBAs.  I think there are different
>>> solutions to these problems, and people are working on both of these
>>> problems.
>>>
>> My point being that zones are just there because the I/O stack can only deal
>> with sectors up to 4k. If the I/O stack would be capable of dealing
>> with larger LBAs one could identify a zone with an LBA, and the entire issue
>> of append-only and sequential writes would be moot.
>> Even the entire concept of zones becomes irrelevant as the OS would
>> trivially only write entire zones.
> 
> All current filesystems that I'm aware of require their fs block size
> to be >= LBA size.  That is, you can't take a 512-byte blocksize ext2
> filesystem and put it on a 4kB LBA storage device.
> 
> That means that files can only grow/shrink in 256MB increments.  I
> don't think that amount of wasted space is going to be acceptable.
> So if we're serious about going down this path, we need to tell
> filesystem people to start working out how to support fs block
> size < LBA size.
> 
> That's a big ask, so let's be sure storage vendors actually want
> this.  Both supporting zoned devices & suporting 16k/64k block
> sizes are easier asks.

Why, I know. And this really is a future goal.
(Possibly a very _distant_ future goal.)

Indeed we should concentrate on getting 16k/64k blocks initially.
Or maybe 128k blocks to help our RAIDed friends.

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-05  4:15             ` Luis Chamberlain
  2023-03-05  5:02               ` Matthew Wilcox
@ 2023-03-06 12:04               ` Hannes Reinecke
  1 sibling, 0 replies; 67+ messages in thread
From: Hannes Reinecke @ 2023-03-06 12:04 UTC (permalink / raw)
  To: Luis Chamberlain, Matthew Wilcox
  Cc: James Bottomley, Keith Busch, Theodore Ts'o, lsf-pc,
	linux-fsdevel, linux-mm, linux-block

On 3/5/23 05:15, Luis Chamberlain wrote:
> On Sat, Mar 04, 2023 at 04:39:02PM +0000, Matthew Wilcox wrote:
>> I'm getting more and more
>> comfortable with the idea that "Linux doesn't support block sizes >
>> PAGE_SIZE on 32-bit machines" is an acceptable answer.
> 
> First of all filesystems would need to add support for a larger block
> sizes > PAGE_SIZE, and that takes effort. It is also a support question
> too.
> 
> I think garnering consensus from filesystem developers we don't want
> to support block sizes > PAGE_SIZE on 32-bit systems would be a good
> thing to review at LSFMM or even on this list. I hightly doubt anyone
> is interested in that support.
> 
>> XFS already works with arbitrary-order folios.
> 
> But block sizes > PAGE_SIZE is work which is still not merged. It
> *can* be with time. That would allow one to muck with larger block
> sizes than 4k on x86-64 for instance. Without this, you can't play
> ball.
> 
>> The only needed piece is
>> specifying to the VFS that there's a minimum order for this particular
>> inode, and having the VFS honour that everywhere.
> 
> Other than the above too, don't we still also need to figure out what
> fs APIs would incur larger order folios? And then what about corner cases
> with the page cache?
> 
> I was hoping some of these nooks and crannies could be explored with tmpfs.
> 
I have just posted patchset for 'brd' to linux-block for supporting 
arbitrary block sizes, both physical and logical. That should be giving 
us a good starting point for experimenting.

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-06  8:23                 ` Matthew Wilcox
  2023-03-06 10:05                   ` Hannes Reinecke
@ 2023-03-06 16:12                   ` Theodore Ts'o
  2023-03-08 17:53                     ` Matthew Wilcox
  1 sibling, 1 reply; 67+ messages in thread
From: Theodore Ts'o @ 2023-03-06 16:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hannes Reinecke, Luis Chamberlain, Keith Busch, Pankaj Raghav,
	Daniel Gomez, Javier González, lsf-pc, linux-fsdevel,
	linux-mm, linux-block

On Mon, Mar 06, 2023 at 08:23:00AM +0000, Matthew Wilcox wrote:
> 
> All current filesystems that I'm aware of require their fs block size
> to be >= LBA size.  That is, you can't take a 512-byte blocksize ext2
> filesystem and put it on a 4kB LBA storage device.
> 
> That means that files can only grow/shrink in 256MB increments.  I
> don't think that amount of wasted space is going to be acceptable.
> So if we're serious about going down this path, we need to tell
> filesystem people to start working out how to support fs block
> size < LBA size.
> 
> That's a big ask, so let's be sure storage vendors actually want
> this.  Both supporting zoned devices & suporting 16k/64k block
> sizes are easier asks.

What HDD vendors want is to be able to have 32k or even 64k *physical*
sector sizes.  This allows for much more efficient erasure codes, so
it will increase their byte capacity now that it's no longer easier to
get capacity boosts by squeezing the tracks closer and closer, and
their have been various engineering tradeoffs with SMR, HAMR, and
MAMR.  HDD vendors have been asking for this at LSF/MM, and in other
venues for ***years***.

This doesn't necessarily mean that the *logical* sector size needs to
be larger.  What I could imagine that HDD vendors could do is to
create HDD disks with, say, a 4k logical block size and a 32k physical
sector size.  This means that 4k random writes will require
read/modify/write cycles, which isn't great from a performance
performance.  However, for those customers who are using raw block
devices for their cluster file system, and for those customers who are
willing to, say, use ext4 with a 4k block size and a 32k cluster size
(using the bigalloc feature), all of the data blocks would be 32k
aligned, and this would work without any modifications.

I suspect that if these drives were made available, this would allow
for a gradual transition to support larger block sizes.  The file
system level changes aren't *that* hard.  There is a chicken and egg
situation here; until these drives are generally available, the
incentive to do the work is minimal.  But with a 4k logical, 32k or
64k physical sector size, we can gradually improve our support for
these file systems with block size > page size, with cluster size >
page size being an intermediate step that would work today.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-05  5:02               ` Matthew Wilcox
@ 2023-03-08  6:11                 ` Luis Chamberlain
  2023-03-08  7:59                   ` Dave Chinner
  0 siblings, 1 reply; 67+ messages in thread
From: Luis Chamberlain @ 2023-03-08  6:11 UTC (permalink / raw)
  To: Matthew Wilcox, Darrick J. Wong, Dave Chinner
  Cc: James Bottomley, Keith Busch, Theodore Ts'o, Pankaj Raghav,
	Daniel Gomez, lsf-pc, linux-fsdevel, linux-mm, linux-block

On Sun, Mar 05, 2023 at 05:02:43AM +0000, Matthew Wilcox wrote:
> On Sat, Mar 04, 2023 at 08:15:50PM -0800, Luis Chamberlain wrote:
> > On Sat, Mar 04, 2023 at 04:39:02PM +0000, Matthew Wilcox wrote:
> > > XFS already works with arbitrary-order folios. 
> > 
> > But block sizes > PAGE_SIZE is work which is still not merged. It
> > *can* be with time. That would allow one to muck with larger block
> > sizes than 4k on x86-64 for instance. Without this, you can't play
> > ball.
> 
> Do you mean that XFS is checking that fs block size <= PAGE_SIZE and
> that check needs to be dropped?  If so, I don't see where that happens.

None of that. Back in 2018 Chinner had prototyped XFS support with
larger block size > PAGE_SIZE:

https://lwn.net/ml/linux-fsdevel/20181107063127.3902-1-david@fromorbit.com/

I just did a quick attempt to rebased it and most of the left over work
is actually on IOMAP for writeback and zero / writes requiring a new
zero-around functionality. All bugs on the rebase are my own, only compile
tested so far, and not happy with some of the changes I had to make so
likely could use tons more love:

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230307-larger-bs-then-ps-xfs

But it should give you an idea of what type of things filesystems need to do.

And so, each fs would need to decide if they want to support this sort
of work. It is important from a support perspective, otherwise its hard
to procure > 4 PAGE_SIZE systems.

  Luis

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-08  6:11                 ` Luis Chamberlain
@ 2023-03-08  7:59                   ` Dave Chinner
  0 siblings, 0 replies; 67+ messages in thread
From: Dave Chinner @ 2023-03-08  7:59 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Matthew Wilcox, Darrick J. Wong, James Bottomley, Keith Busch,
	Theodore Ts'o, Pankaj Raghav, Daniel Gomez, lsf-pc,
	linux-fsdevel, linux-mm, linux-block

On Tue, Mar 07, 2023 at 10:11:43PM -0800, Luis Chamberlain wrote:
> On Sun, Mar 05, 2023 at 05:02:43AM +0000, Matthew Wilcox wrote:
> > On Sat, Mar 04, 2023 at 08:15:50PM -0800, Luis Chamberlain wrote:
> > > On Sat, Mar 04, 2023 at 04:39:02PM +0000, Matthew Wilcox wrote:
> > > > XFS already works with arbitrary-order folios. 
> > > 
> > > But block sizes > PAGE_SIZE is work which is still not merged. It
> > > *can* be with time. That would allow one to muck with larger block
> > > sizes than 4k on x86-64 for instance. Without this, you can't play
> > > ball.
> > 
> > Do you mean that XFS is checking that fs block size <= PAGE_SIZE and
> > that check needs to be dropped?  If so, I don't see where that happens.
> 
> None of that. Back in 2018 Chinner had prototyped XFS support with
> larger block size > PAGE_SIZE:
> 
> https://lwn.net/ml/linux-fsdevel/20181107063127.3902-1-david@fromorbit.com/

Having a working BS > PS implementation on XFS based on variable
page order support in the page cache goes back over a
decade before that.

Christoph Lameter did the page cache work, and I added support for
XFS back in 2007. THe total change to XFS required can be seen in
this simple patch:

https://lore.kernel.org/linux-mm/20070423093152.GI32602149@melbourne.sgi.com/

That was when the howls of anguish about high order allocations
Willy mentioned started....

> I just did a quick attempt to rebased it and most of the left over work
> is actually on IOMAP for writeback and zero / writes requiring a new
> zero-around functionality. All bugs on the rebase are my own, only compile
> tested so far, and not happy with some of the changes I had to make so
> likely could use tons more love:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230307-larger-bs-then-ps-xfs

On a current kernel, that patchset is fundamentally broken as we
have multi-page folio support in XFS and iomap - the patchset is
inherently PAGE_SIZE based and it will do the the wrong thing with
PAGE_SIZE based zero-around.

IOWs, IOMAP_F_ZERO_AROUND does not need to exist any more, nor
should any of the custom hooks it triggered in different operations
for zero-around.  That's because we should now be using the same
approach to BS > PS as we first used back in 2007. We already
support multi-page folios in the page cache, so all the zero-around
and partial folio uptodate tracking we need is already in place.

Hence, like Willy said, all we need to do is have
filemap_get_folio(FGP_CREAT) always allocate at least filesystem
block sized and aligned folio and insert them into the mapping tree.
Multi-page folios will always need to be sized as an integer
multiple of the filesystem block size, but once we ensure size and
alignment of folios in the page cache, we get everything else for
free.

/me cues the howls of anguish over memory fragmentation....

> But it should give you an idea of what type of things filesystems need to do.

Not really. it gives you an idea of what filesystems needed to do 5
years ago to support BS > PS. We're living in the age of folios now,
not pages.  Willy starting work on folios was why I dropped that
patch set, firstly because it was going to make the iomap conversion
to folios harder, and secondly, we realised that none of it was
necessary if folios supported multi-page constructs in the page
cache natively.

IOWs, multi-page folios in the page cache should make BS > PS mostly
trivial to support for any filesystem or block device that doesn't
have some other dependency on PAGE_SIZE objects in the page cache
(e.g. bufferheads).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-06 16:12                   ` Theodore Ts'o
@ 2023-03-08 17:53                     ` Matthew Wilcox
  2023-03-08 18:13                       ` James Bottomley
  0 siblings, 1 reply; 67+ messages in thread
From: Matthew Wilcox @ 2023-03-08 17:53 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Hannes Reinecke, Luis Chamberlain, Keith Busch, Pankaj Raghav,
	Daniel Gomez, Javier González, lsf-pc, linux-fsdevel,
	linux-mm, linux-block

On Mon, Mar 06, 2023 at 11:12:14AM -0500, Theodore Ts'o wrote:
> What HDD vendors want is to be able to have 32k or even 64k *physical*
> sector sizes.  This allows for much more efficient erasure codes, so
> it will increase their byte capacity now that it's no longer easier to
> get capacity boosts by squeezing the tracks closer and closer, and
> their have been various engineering tradeoffs with SMR, HAMR, and
> MAMR.  HDD vendors have been asking for this at LSF/MM, and in other
> venues for ***years***.

I've been reminded by a friend who works on the drive side that a
motivation for the SSD vendors is (essentially) the size of sector_t.
Once the drive needs to support more than 2/4 billion sectors, they
need to move to a 64-bit sector size, so the amount of memory consumed
by the FTL doubles, the CPU data cache becomes half as effective, etc.
That significantly increases the BOM for the drive, and so they have
to charge more.  With a 512-byte LBA, that's 2TB; with a 4096-byte LBA,
it's at 16TB and with a 64k LBA, they can keep using 32-bit LBA numbers
all the way up to 256TB.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-08 17:53                     ` Matthew Wilcox
@ 2023-03-08 18:13                       ` James Bottomley
  2023-03-09  8:04                         ` Javier González
  0 siblings, 1 reply; 67+ messages in thread
From: James Bottomley @ 2023-03-08 18:13 UTC (permalink / raw)
  To: Matthew Wilcox, Theodore Ts'o
  Cc: Hannes Reinecke, Luis Chamberlain, Keith Busch, Pankaj Raghav,
	Daniel Gomez, Javier González, lsf-pc, linux-fsdevel,
	linux-mm, linux-block

On Wed, 2023-03-08 at 17:53 +0000, Matthew Wilcox wrote:
> On Mon, Mar 06, 2023 at 11:12:14AM -0500, Theodore Ts'o wrote:
> > What HDD vendors want is to be able to have 32k or even 64k
> > *physical* sector sizes.  This allows for much more efficient
> > erasure codes, so it will increase their byte capacity now that
> > it's no longer easier to get capacity boosts by squeezing the
> > tracks closer and closer, and their have been various engineering
> > tradeoffs with SMR, HAMR, and MAMR.  HDD vendors have been asking
> > for this at LSF/MM, and in othervenues for ***years***.
> 
> I've been reminded by a friend who works on the drive side that a
> motivation for the SSD vendors is (essentially) the size of sector_t.
> Once the drive needs to support more than 2/4 billion sectors, they
> need to move to a 64-bit sector size, so the amount of memory
> consumed by the FTL doubles, the CPU data cache becomes half as
> effective, etc. That significantly increases the BOM for the drive,
> and so they have to charge more.  With a 512-byte LBA, that's 2TB;
> with a 4096-byte LBA, it's at 16TB and with a 64k LBA, they can keep
> using 32-bit LBA numbers all the way up to 256TB.

I thought the FTL operated on physical sectors and the logical to
physical was done as a RMW through the FTL?  In which case sector_t
shouldn't matter to the SSD vendors for FTL management because they can
keep the logical sector size while increasing the physical one. 
Obviously if physical size goes above the FS block size, the drives
will behave suboptimally with RMWs, which is why 4k physical is the max
currently.

James


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-05 11:22               ` Hannes Reinecke
  2023-03-06  8:23                 ` Matthew Wilcox
@ 2023-03-08 19:35                 ` Luis Chamberlain
  2023-03-08 19:55                 ` Bart Van Assche
  2 siblings, 0 replies; 67+ messages in thread
From: Luis Chamberlain @ 2023-03-08 19:35 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Matthew Wilcox, Keith Busch, Theodore Ts'o, Pankaj Raghav,
	Daniel Gomez, Javier González, lsf-pc, linux-fsdevel,
	linux-mm, linux-block

On Sun, Mar 05, 2023 at 12:22:15PM +0100, Hannes Reinecke wrote:
> One can view zones as really large LBAs.
> 
> Indeed it might be suboptimal from the OS point of view.
> But from the device point of view it won't.
> And, in fact, with devices becoming faster and faster the question is
> whether sticking with relatively small sectors won't become a limiting
> factor eventually.
> 
> My point being that zones are just there because the I/O stack can only deal
> with sectors up to 4k. If the I/O stack would be capable of dealing
> with larger LBAs one could identify a zone with an LBA, and the entire issue
> of append-only and sequential writes would be moot.
> Even the entire concept of zones becomes irrelevant as the OS would
> trivially only write entire zones.
> 
> What I was saying is that 256M is not set in stone. It's just a compromise
> vendors used. Even if in the course of development we arrive
> at a lower number of max LBA we can handle (say, 2MB) I am pretty
> sure vendors will be quite interested in that.

So I'm re-reading this again and I see what you're suggesting now Hannes.
                                                                                
You are not not suggesting that the reason why we may want larger block
sizes is due to zone storage support.  But rather, you are suggesting
that *if* we support larger block sizes, they effectively could be used
as a replacement for smaller zone sizes.  Your comments about 256 MiB
zones is just a target max assumption for existing known zones.

So in that sense, you seem to suggest that users of smaller zone sizes
could potentially look at using instead larger block sizes, as there
would be no other new "feature" other than existing efforts to ensure
higher folio support are in place and / buffer heads addressed.

But this misses the gains of zone storage on the FTL. The strong semantics
of sequential writes and a write pointer differ for how an existing storage
controller may deal with writing to *one* block. You are not forbidden to
just modify a bit in non-zone storage, behind the scenes for instance the
FTL would do whatever it thinks it has to, very likely a read-modify-write
and it may just splash the write into one fresh block for you, so the
write appears to happen in a flash but in reality it used a bit of the
over provisioning blocks. But with zone storage you have a considerable
reduction over over provisioning, which we don't get for with simple larger
block size support for non zone drives.

  Luis

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-05 11:22               ` Hannes Reinecke
  2023-03-06  8:23                 ` Matthew Wilcox
  2023-03-08 19:35                 ` Luis Chamberlain
@ 2023-03-08 19:55                 ` Bart Van Assche
  2 siblings, 0 replies; 67+ messages in thread
From: Bart Van Assche @ 2023-03-08 19:55 UTC (permalink / raw)
  To: Hannes Reinecke, Matthew Wilcox
  Cc: Luis Chamberlain, Keith Busch, Theodore Ts'o, Pankaj Raghav,
	Daniel Gomez, Javier González, lsf-pc, linux-fsdevel,
	linux-mm, linux-block

On 3/5/23 03:22, Hannes Reinecke wrote:
> My point being that zones are just there because the I/O stack can only 
> deal with sectors up to 4k. If the I/O stack would be capable of dealing
> with larger LBAs one could identify a zone with an LBA, and the entire 
> issue of append-only and sequential writes would be moot.
> Even the entire concept of zones becomes irrelevant as the OS would 
> trivially only write entire zones.

That's not correct. Even if the block layer core would support logical 
block sizes of 1 GiB or higher, a logical block size of 16 KiB will 
yield better performance than logical block size = zone size. The write 
amplification factor (WAF) would be huge for databases if the logical 
block size would be much larger than the typical amount of data written 
during a database update (16 KiB?).

Bart.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-08 18:13                       ` James Bottomley
@ 2023-03-09  8:04                         ` Javier González
  2023-03-09 13:11                           ` James Bottomley
  0 siblings, 1 reply; 67+ messages in thread
From: Javier González @ 2023-03-09  8:04 UTC (permalink / raw)
  To: James Bottomley
  Cc: Matthew Wilcox, Theodore Ts'o, Hannes Reinecke,
	Luis Chamberlain, Keith Busch, Pankaj Raghav, Daniel Gomez,
	lsf-pc, linux-fsdevel, linux-mm, linux-block

On 08.03.2023 13:13, James Bottomley wrote:
>On Wed, 2023-03-08 at 17:53 +0000, Matthew Wilcox wrote:
>> On Mon, Mar 06, 2023 at 11:12:14AM -0500, Theodore Ts'o wrote:
>> > What HDD vendors want is to be able to have 32k or even 64k
>> > *physical* sector sizes.  This allows for much more efficient
>> > erasure codes, so it will increase their byte capacity now that
>> > it's no longer easier to get capacity boosts by squeezing the
>> > tracks closer and closer, and their have been various engineering
>> > tradeoffs with SMR, HAMR, and MAMR.  HDD vendors have been asking
>> > for this at LSF/MM, and in othervenues for ***years***.
>>
>> I've been reminded by a friend who works on the drive side that a
>> motivation for the SSD vendors is (essentially) the size of sector_t.
>> Once the drive needs to support more than 2/4 billion sectors, they
>> need to move to a 64-bit sector size, so the amount of memory
>> consumed by the FTL doubles, the CPU data cache becomes half as
>> effective, etc. That significantly increases the BOM for the drive,
>> and so they have to charge more.  With a 512-byte LBA, that's 2TB;
>> with a 4096-byte LBA, it's at 16TB and with a 64k LBA, they can keep
>> using 32-bit LBA numbers all the way up to 256TB.
>
>I thought the FTL operated on physical sectors and the logical to
>physical was done as a RMW through the FTL?  In which case sector_t
>shouldn't matter to the SSD vendors for FTL management because they can
>keep the logical sector size while increasing the physical one.
>Obviously if physical size goes above the FS block size, the drives
>will behave suboptimally with RMWs, which is why 4k physical is the max
>currently.
>

FTL designs are complex. We have ways to maintain sector sizes under 64
bits, but this is a common industry problem.

The media itself does not normally oeprate at 4K. Page siges can be 16K,
32K, etc. Increasing the block size would allow for better host/device
cooperation. As Ted mentions, this has been a requirement for HDD and
SSD vendor for years. It seems to us that the time is right now and that
we have mechanisms in Linux to do the plumbing. Folios is ovbiously a
big part of this.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-09  8:04                         ` Javier González
@ 2023-03-09 13:11                           ` James Bottomley
  2023-03-09 14:05                             ` Keith Busch
                                               ` (2 more replies)
  0 siblings, 3 replies; 67+ messages in thread
From: James Bottomley @ 2023-03-09 13:11 UTC (permalink / raw)
  To: Javier González
  Cc: Matthew Wilcox, Theodore Ts'o, Hannes Reinecke,
	Luis Chamberlain, Keith Busch, Pankaj Raghav, Daniel Gomez,
	lsf-pc, linux-fsdevel, linux-mm, linux-block

On Thu, 2023-03-09 at 09:04 +0100, Javier González wrote:
> On 08.03.2023 13:13, James Bottomley wrote:
> > On Wed, 2023-03-08 at 17:53 +0000, Matthew Wilcox wrote:
> > > On Mon, Mar 06, 2023 at 11:12:14AM -0500, Theodore Ts'o wrote:
> > > > What HDD vendors want is to be able to have 32k or even 64k
> > > > *physical* sector sizes.  This allows for much more efficient
> > > > erasure codes, so it will increase their byte capacity now that
> > > > it's no longer easier to get capacity boosts by squeezing the
> > > > tracks closer and closer, and their have been various
> > > > engineering tradeoffs with SMR, HAMR, and MAMR.  HDD vendors
> > > > have been asking for this at LSF/MM, and in othervenues for
> > > > ***years***.
> > > 
> > > I've been reminded by a friend who works on the drive side that a
> > > motivation for the SSD vendors is (essentially) the size of
> > > sector_t. Once the drive needs to support more than 2/4 billion
> > > sectors, they need to move to a 64-bit sector size, so the amount
> > > of memory consumed by the FTL doubles, the CPU data cache becomes
> > > half as effective, etc. That significantly increases the BOM for
> > > the drive, and so they have to charge more.  With a 512-byte LBA,
> > > that's 2TB; with a 4096-byte LBA, it's at 16TB and with a 64k
> > > LBA, they can keep using 32-bit LBA numbers all the way up to
> > > 256TB.
> > 
> > I thought the FTL operated on physical sectors and the logical to
> > physical was done as a RMW through the FTL?  In which case sector_t
> > shouldn't matter to the SSD vendors for FTL management because they
> > can keep the logical sector size while increasing the physical one.
> > Obviously if physical size goes above the FS block size, the drives
> > will behave suboptimally with RMWs, which is why 4k physical is the
> > max currently.
> > 
> 
> FTL designs are complex. We have ways to maintain sector sizes under
> 64 bits, but this is a common industry problem.
> 
> The media itself does not normally oeprate at 4K. Page siges can be
> 16K, 32K, etc.

Right, and we've always said if we knew what this size was we could
make better block write decisions.  However, today if you look what
most NVMe devices are reporting, it's a bit sub-optimal:

jejb@lingrow:/sys/block/nvme1n1/queue> cat logical_block_size 
512
jejb@lingrow:/sys/block/nvme1n1/queue> cat physical_block_size 
512
jejb@lingrow:/sys/block/nvme1n1/queue> cat optimal_io_size 
0

If we do get Linux to support large block sizes, are we actually going
to get better information out of the devices?

>  Increasing the block size would allow for better host/device
> cooperation. As Ted mentions, this has been a requirement for HDD and
> SSD vendor for years. It seems to us that the time is right now and
> that we have mechanisms in Linux to do the plumbing. Folios is
> ovbiously a big part of this.

Well a decade ago we did a lot of work to support 4k sector devices.
Ultimately the industry went with 512 logical/4k physical devices
because of problems with non-Linux proprietary OSs but you could still
use 4k today if you wanted (I've actually still got a working 4k SCSI
drive), so why is no NVMe device doing that?

This is not to say I think larger block sizes is in any way a bad idea
... I just think that given the history, it will be driven by
application needs rather than what the manufacturers tell us.

James


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-09 13:11                           ` James Bottomley
@ 2023-03-09 14:05                             ` Keith Busch
  2023-03-09 15:23                             ` Martin K. Petersen
  2023-03-10  7:59                             ` Javier González
  2 siblings, 0 replies; 67+ messages in thread
From: Keith Busch @ 2023-03-09 14:05 UTC (permalink / raw)
  To: James Bottomley
  Cc: Javier González, Matthew Wilcox, Theodore Ts'o,
	Hannes Reinecke, Luis Chamberlain, Pankaj Raghav, Daniel Gomez,
	lsf-pc, linux-fsdevel, linux-mm, linux-block

On Thu, Mar 09, 2023 at 08:11:35AM -0500, James Bottomley wrote:
> On Thu, 2023-03-09 at 09:04 +0100, Javier González wrote:
> > FTL designs are complex. We have ways to maintain sector sizes under
> > 64 bits, but this is a common industry problem.
> > 
> > The media itself does not normally oeprate at 4K. Page siges can be
> > 16K, 32K, etc.
> 
> Right, and we've always said if we knew what this size was we could
> make better block write decisions.  However, today if you look what
> most NVMe devices are reporting, it's a bit sub-optimal:

Your sample size may be off if your impression is that "most" NVMe drives
report themselves this way. :)
 
> jejb@lingrow:/sys/block/nvme1n1/queue> cat logical_block_size 
> 512
> jejb@lingrow:/sys/block/nvme1n1/queue> cat physical_block_size 
> 512
> jejb@lingrow:/sys/block/nvme1n1/queue> cat optimal_io_size 
> 0
> 
> If we do get Linux to support large block sizes, are we actually going
> to get better information out of the devices?
> 
> >  Increasing the block size would allow for better host/device
> > cooperation. As Ted mentions, this has been a requirement for HDD and
> > SSD vendor for years. It seems to us that the time is right now and
> > that we have mechanisms in Linux to do the plumbing. Folios is
> > ovbiously a big part of this.
> 
> Well a decade ago we did a lot of work to support 4k sector devices.
> Ultimately the industry went with 512 logical/4k physical devices
> because of problems with non-Linux proprietary OSs but you could still
> use 4k today if you wanted (I've actually still got a working 4k SCSI
> drive), so why is no NVMe device doing that?

In my experience, all but the cheapest consumer grade nvme devices report 4k
logical. They all support an option to emulate 512b if you really wanted it to,
but the more optimal 4k is the most common default for server grade nvme.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-09 13:11                           ` James Bottomley
  2023-03-09 14:05                             ` Keith Busch
@ 2023-03-09 15:23                             ` Martin K. Petersen
  2023-03-09 20:49                               ` James Bottomley
  2023-03-10  7:59                             ` Javier González
  2 siblings, 1 reply; 67+ messages in thread
From: Martin K. Petersen @ 2023-03-09 15:23 UTC (permalink / raw)
  To: James Bottomley
  Cc: Javier González, Matthew Wilcox, Theodore Ts'o,
	Hannes Reinecke, Luis Chamberlain, Keith Busch, Pankaj Raghav,
	Daniel Gomez, lsf-pc, linux-fsdevel, linux-mm, linux-block


James,

> Well a decade ago we did a lot of work to support 4k sector devices.
> Ultimately the industry went with 512 logical/4k physical devices
> because of problems with non-Linux proprietary OSs but you could still
> use 4k today if you wanted (I've actually still got a working 4k SCSI
> drive), so why is no NVMe device doing that?

FWIW, I have SATA, SAS, and NVMe devices that report 4KB logical.

The reason the industry converged on 512e is that the performance
problems were solved by ensuring correct alignment and transfer length.

Almost every I/O we submit is a multiple of 4KB. So if things are
properly aligned wrt. the device's physical block size, it is irrelevant
whether we express CDB fields in units of 512 bytes or 4KB. We're still
transferring the same number of bytes.

In addition 512e had two additional advantages that 4Kn didn't:

1. Legacy applications doing direct I/O and expecting 512-byte blocks
   kept working (albeit with a penalty for writes smaller than a
   physical block).

2. For things like PI where the 16-bit CRC is underwhelming wrt.
   detecting errors in 4096 bytes of data, leaving the protection
   interval at 512 bytes was also a benefit. So while 4Kn adoption
   looked strong inside enterprise disk arrays initially, several
   vendors ended up with 512e for PI reasons.

Once I/Os from the OS were properly aligned, there was just no
compelling reason for anyone to go with 4Kn and having to deal with
multiple SKUs, etc.

For NVMe 4Kn was prevalent for a while but drives have started
gravitating towards 512n/512e. Perhaps because of (1) above. Plus
whatever problems there may be on other platforms as you mentioned...

> This is not to say I think larger block sizes is in any way a bad idea
> ... I just think that given the history, it will be driven by
> application needs rather than what the manufacturers tell us.

I think it would be beneficial for Linux to support filesystem blocks
larger than the page size. Based on experience outlined above, I am not
convinced larger logical block sizes will get much traction. But that
doesn't prevent devices from advertising larger physical/minimum/optimal
I/O sizes and for us to handle those more gracefully than we currently
do.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-09 15:23                             ` Martin K. Petersen
@ 2023-03-09 20:49                               ` James Bottomley
  2023-03-09 21:13                                 ` Luis Chamberlain
  0 siblings, 1 reply; 67+ messages in thread
From: James Bottomley @ 2023-03-09 20:49 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Javier González, Matthew Wilcox, Theodore Ts'o,
	Hannes Reinecke, Luis Chamberlain, Keith Busch, Pankaj Raghav,
	Daniel Gomez, lsf-pc, linux-fsdevel, linux-mm, linux-block

On Thu, 2023-03-09 at 10:23 -0500, Martin K. Petersen wrote:
> > This is not to say I think larger block sizes is in any way a bad
> > idea ... I just think that given the history, it will be driven by
> > application needs rather than what the manufacturers tell us.
> 
> I think it would be beneficial for Linux to support filesystem blocks
> larger than the page size. Based on experience outlined above, I am
> not convinced larger logical block sizes will get much traction. But
> that doesn't prevent devices from advertising larger
> physical/minimum/optimal I/O sizes and for us to handle those more
> gracefully than we currently do.

Right, I was wondering if we could try to persuade the Manufacturers to
advertise a more meaningful optimal I/O size ...  But as you say, the
pressure is coming from applications and filesystems for larger block
sizes and that will create I/O patterns that are more beneficial to the
underlying device hardware regardless of whether it actually tells us
anything about what it would like.

James


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-09 20:49                               ` James Bottomley
@ 2023-03-09 21:13                                 ` Luis Chamberlain
  2023-03-09 21:28                                   ` Martin K. Petersen
  0 siblings, 1 reply; 67+ messages in thread
From: Luis Chamberlain @ 2023-03-09 21:13 UTC (permalink / raw)
  To: James Bottomley, Dan Helmick
  Cc: Martin K. Petersen, Javier González, Matthew Wilcox,
	Theodore Ts'o, Hannes Reinecke, Keith Busch, Pankaj Raghav,
	Daniel Gomez, lsf-pc, linux-fsdevel, linux-mm, linux-block

On Thu, Mar 09, 2023 at 03:49:50PM -0500, James Bottomley wrote:
> On Thu, 2023-03-09 at 10:23 -0500, Martin K. Petersen wrote:
> > > This is not to say I think larger block sizes is in any way a bad
> > > idea ... I just think that given the history, it will be driven by
> > > application needs rather than what the manufacturers tell us.
> > 
> > I think it would be beneficial for Linux to support filesystem blocks
> > larger than the page size. Based on experience outlined above, I am
> > not convinced larger logical block sizes will get much traction. But
> > that doesn't prevent devices from advertising larger
> > physical/minimum/optimal I/O sizes and for us to handle those more
> > gracefully than we currently do.
> 
> Right, I was wondering if we could try to persuade the Manufacturers to
> advertise a more meaningful optimal I/O size ...

Advocacy for using meaningful values is a real thing, Dan Helmick talked
about this at the last SDC 2022 at least for NVMe:

https://www.youtube.com/watch?v=3_M92RlVgIQ&ab_channel=SNIAVideo

A big future question is of course how / when to use these for filesystems.
Should there be, for instance a 'mkfs --optimal-bs' or something which
may look whatever hints the media uses ? Or do we just leaves the magic
incantations to the admins?

  Luis

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-09 21:13                                 ` Luis Chamberlain
@ 2023-03-09 21:28                                   ` Martin K. Petersen
  2023-03-10  1:16                                     ` Dan Helmick
  0 siblings, 1 reply; 67+ messages in thread
From: Martin K. Petersen @ 2023-03-09 21:28 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: James Bottomley, Dan Helmick, Martin K. Petersen,
	Javier González, Matthew Wilcox, Theodore Ts'o,
	Hannes Reinecke, Keith Busch, Pankaj Raghav, Daniel Gomez,
	lsf-pc, linux-fsdevel, linux-mm, linux-block


Luis,

> A big future question is of course how / when to use these for
> filesystems.  Should there be, for instance a 'mkfs --optimal-bs' or
> something which may look whatever hints the media uses ? Or do we just
> leaves the magic incantations to the admins?

mkfs already considers the reported queue limits (for the filesystems
most people use, anyway).

The problem is mainly that the devices don't report them. At least not
very often in the NVMe space. For SCSI devices, reporting these
parameters is quite common.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-09 21:28                                   ` Martin K. Petersen
@ 2023-03-10  1:16                                     ` Dan Helmick
  0 siblings, 0 replies; 67+ messages in thread
From: Dan Helmick @ 2023-03-10  1:16 UTC (permalink / raw)
  To: Martin K. Petersen, Luis Chamberlain
  Cc: James Bottomley, Javier González, Matthew Wilcox,
	Theodore Ts'o, Hannes Reinecke, Keith Busch, Pankaj Raghav,
	Daniel Gomez, lsf-pc, linux-fsdevel, linux-mm, linux-block

> -----Original Message-----
> From: Martin K. Petersen [mailto:martin.petersen@oracle.com]
> Sent: Thursday, March 9, 2023 2:28 PM
> To: Luis Chamberlain <mcgrof@kernel.org>
> Cc: James Bottomley <James.Bottomley@hansenpartnership.com>; Dan
> Helmick <dan.helmick@samsung.com>; Martin K. Petersen
> <martin.petersen@oracle.com>; Javier González
> <javier.gonz@samsung.com>; Matthew Wilcox <willy@infradead.org>;
> Theodore Ts'o <tytso@mit.edu>; Hannes Reinecke <hare@suse.de>; Keith
> Busch <kbusch@kernel.org>; Pankaj Raghav <p.raghav@samsung.com>;
> Daniel Gomez <da.gomez@samsung.com>; lsf-pc@lists.linux-foundation.org;
> linux-fsdevel@vger.kernel.org; linux-mm@kvack.org; linux-
> block@vger.kernel.org
> Subject: Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
> 
> 
> Luis,
> 
> > A big future question is of course how / when to use these for
> > filesystems.  Should there be, for instance a 'mkfs --optimal-bs' or
> > something which may look whatever hints the media uses ? Or do we just
> > leaves the magic incantations to the admins?
> 
> mkfs already considers the reported queue limits (for the filesystems most
> people use, anyway).
> 
> The problem is mainly that the devices don't report them. At least not very
> often in the NVMe space. For SCSI devices, reporting these parameters is
> quite common.
> 
> --
> Martin K. Petersen	Oracle Linux Engineering

Support for the NVMe Optimal Performance parameters is increasing in the vendor ecosystem.  Customers are requiring this more and more from the vendors.  For example, the OCP DC NVMe SSD spec has NVMe-AD-2 and NVMe-OPT-7 [1].  Momentum is continuing as Optimal Read parameters were recently added to NVMe too.  More companies adding these parameters as a drive requirement to drive vendors would definitely help the momentum further.  

I think there has been confusion among the vendors in the past on how to set various values for the best Host behavior.  There are multiple (sometimes minor) inflection points in the performance of a drive.  Sure.  4KB is too small to report by the drive, but shall we report our 16KB, 128KB, or some other inflection?  How big of a value can we push this?  We would always favor the bigger number.  

There are benefits for both Host and Drive (HDD and SSD) to have larger IOs.  Even if you have a drive reporting incorrect optimal parameters today, one can incubate the SW changes with larger IOs.  If nothing else, you'll instantly save on the overheads of communicating the higher number of commands.  Further doing an IO sized to be a multiple of the optimal parameters is also optimal.  Enabling anything in the range 16KB - 64KB would likely be a great start.  

[1] https://www.opencompute.org/documents/datacenter-nvme-ssd-specification-v2-0r21-pdf


Dan

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-09 13:11                           ` James Bottomley
  2023-03-09 14:05                             ` Keith Busch
  2023-03-09 15:23                             ` Martin K. Petersen
@ 2023-03-10  7:59                             ` Javier González
  2 siblings, 0 replies; 67+ messages in thread
From: Javier González @ 2023-03-10  7:59 UTC (permalink / raw)
  To: James Bottomley
  Cc: Matthew Wilcox, Theodore Ts'o, Hannes Reinecke,
	Luis Chamberlain, Keith Busch, Pankaj Raghav, Daniel Gomez,
	lsf-pc, linux-fsdevel, linux-mm, linux-block

On 09.03.2023 08:11, James Bottomley wrote:
>On Thu, 2023-03-09 at 09:04 +0100, Javier González wrote:
>> On 08.03.2023 13:13, James Bottomley wrote:
>> > On Wed, 2023-03-08 at 17:53 +0000, Matthew Wilcox wrote:
>> > > On Mon, Mar 06, 2023 at 11:12:14AM -0500, Theodore Ts'o wrote:
>> > > > What HDD vendors want is to be able to have 32k or even 64k
>> > > > *physical* sector sizes.  This allows for much more efficient
>> > > > erasure codes, so it will increase their byte capacity now that
>> > > > it's no longer easier to get capacity boosts by squeezing the
>> > > > tracks closer and closer, and their have been various
>> > > > engineering tradeoffs with SMR, HAMR, and MAMR.  HDD vendors
>> > > > have been asking for this at LSF/MM, and in othervenues for
>> > > > ***years***.
>> > >
>> > > I've been reminded by a friend who works on the drive side that a
>> > > motivation for the SSD vendors is (essentially) the size of
>> > > sector_t. Once the drive needs to support more than 2/4 billion
>> > > sectors, they need to move to a 64-bit sector size, so the amount
>> > > of memory consumed by the FTL doubles, the CPU data cache becomes
>> > > half as effective, etc. That significantly increases the BOM for
>> > > the drive, and so they have to charge more.  With a 512-byte LBA,
>> > > that's 2TB; with a 4096-byte LBA, it's at 16TB and with a 64k
>> > > LBA, they can keep using 32-bit LBA numbers all the way up to
>> > > 256TB.
>> >
>> > I thought the FTL operated on physical sectors and the logical to
>> > physical was done as a RMW through the FTL?  In which case sector_t
>> > shouldn't matter to the SSD vendors for FTL management because they
>> > can keep the logical sector size while increasing the physical one.
>> > Obviously if physical size goes above the FS block size, the drives
>> > will behave suboptimally with RMWs, which is why 4k physical is the
>> > max currently.
>> >
>>
>> FTL designs are complex. We have ways to maintain sector sizes under
>> 64 bits, but this is a common industry problem.
>>
>> The media itself does not normally oeprate at 4K. Page siges can be
>> 16K, 32K, etc.
>
>Right, and we've always said if we knew what this size was we could
>make better block write decisions.  However, today if you look what
>most NVMe devices are reporting, it's a bit sub-optimal:
>
>jejb@lingrow:/sys/block/nvme1n1/queue> cat logical_block_size
>512
>jejb@lingrow:/sys/block/nvme1n1/queue> cat physical_block_size
>512
>jejb@lingrow:/sys/block/nvme1n1/queue> cat optimal_io_size
>0
>
>If we do get Linux to support large block sizes, are we actually going
>to get better information out of the devices?

We already have this through the NVMe Optimal Performance parameters
(see Dan's response for this). Note that these values are already
implemented in the kernel. If I recall properly, Bart was the one doing
this work.

More over, from the vendor side, it is a challenge to expose larger LBAs
without wide support in OSs. I am confident that if we are pushing for
this work and we see it fits existing FSs, we will see vendors exposing
new LBA formats in the beginning (same as we have 512b and 4K in the
same drive), and eventually focusing only on larger LBA sizes.

>
>>  Increasing the block size would allow for better host/device
>> cooperation. As Ted mentions, this has been a requirement for HDD and
>> SSD vendor for years. It seems to us that the time is right now and
>> that we have mechanisms in Linux to do the plumbing. Folios is
>> ovbiously a big part of this.
>
>Well a decade ago we did a lot of work to support 4k sector devices.
>Ultimately the industry went with 512 logical/4k physical devices
>because of problems with non-Linux proprietary OSs but you could still
>use 4k today if you wanted (I've actually still got a working 4k SCSI
>drive), so why is no NVMe device doing that?

Most NVMe devices report 4K today. Actually 512b is mostly an
optimization targeted at read-heavy workloads.

>
>This is not to say I think larger block sizes is in any way a bad idea
>... I just think that given the history, it will be driven by
>application needs rather than what the manufacturers tell us.

I see more and more that this deserves a session at LSF/MM

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-03 22:32           ` Keith Busch
  2023-03-03 23:09             ` Luis Chamberlain
@ 2023-03-16 15:29             ` Pankaj Raghav
  2023-03-16 15:41               ` Pankaj Raghav
  1 sibling, 1 reply; 67+ messages in thread
From: Pankaj Raghav @ 2023-03-16 15:29 UTC (permalink / raw)
  To: Keith Busch, Luis Chamberlain
  Cc: Matthew Wilcox, Theodore Ts'o, Daniel Gomez,
	Javier González, lsf-pc, linux-fsdevel, linux-mm,
	linux-block, Dave Chinner, Christoph Hellwig

Hi Keith,

On 2023-03-03 23:32, Keith Busch wrote:
>> Yes, clearly it says *yet* so that begs the question what would be
>> required?
> 
> Oh, gotcha. I'll work on a list of places it currently crashes.
>  
I started looking into this to see why it crashes when we increase the LBA
size of a block device greater than the page size. These are my primary
findings:

- Block device aops (address_space_operations) are all based on buffer
head, which limits us to work on only PAGE_SIZE chunks.

For a 8k LBA size, the stack trace you posted ultimately fails inside
alloc_page_buffers as the size will be > PAGE_SIZE.

struct buffer_head *alloc_page_buffers(struct page *page, unsigned long
size, bool retry)



{



        struct buffer_head *bh, *head;



....







        head = NULL;



        offset = PAGE_SIZE;



        while ((offset -= size) >= 0) {
	// we will not go into this loop as offset will be negative
...
...
	}
	return head;
}

- As Dave chinner pointed out later in the thread, we allocate pages in the
page cache with order 0, instead of BS of the device or the filesystem.
Letting filemap_get_folio(FGP_CREAT) allocate folios in LBA size for a
block device should solve that problem, I guess.

Is it a crazy idea to convert block device aops (block/fops.c) to use iomap
which supports higher order folios instead of mpage and other functions
that use buffer head?

Let me know your thoughts.
--
Pankaj

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
  2023-03-16 15:29             ` Pankaj Raghav
@ 2023-03-16 15:41               ` Pankaj Raghav
  0 siblings, 0 replies; 67+ messages in thread
From: Pankaj Raghav @ 2023-03-16 15:41 UTC (permalink / raw)
  To: Keith Busch, Luis Chamberlain
  Cc: Matthew Wilcox, Theodore Ts'o, Daniel Gomez,
	Javier González, lsf-pc, linux-fsdevel, linux-mm,
	linux-block, Dave Chinner, Christoph Hellwig

On 2023-03-16 16:29, Pankaj Raghav wrote:
> Hi Keith,
> 
> On 2023-03-03 23:32, Keith Busch wrote:
>>> Yes, clearly it says *yet* so that begs the question what would be
>>> required?
>>
>> Oh, gotcha. I'll work on a list of places it currently crashes.
>>  
> I started looking into this to see why it crashes when we increase the LBA
> size of a block device greater than the page size. These are my primary
> findings:
> 
> - Block device aops (address_space_operations) are all based on buffer
> head, which limits us to work on only PAGE_SIZE chunks.
> 
> For a 8k LBA size, the stack trace you posted ultimately fails inside
> alloc_page_buffers as the size will be > PAGE_SIZE.
> 
> struct buffer_head *alloc_page_buffers(struct page *page, unsigned long
> size, bool retry)
> 
Aghh. Sorry for the ugly formatting:

struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
		bool retry)
{
	struct buffer_head *bh, *head;
	gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT;
	long offset;
	struct mem_cgroup *memcg, *old_memcg;

	if (retry)
		gfp |= __GFP_NOFAIL;

	/* The page lock pins the memcg */
	memcg = page_memcg(page);
	old_memcg = set_active_memcg(memcg);

	head = NULL;
	offset = PAGE_SIZE;
	while ((offset -= size) >= 0) {
	// we will not go into this loop as offset will be negative
	...
	...
	}
...
return head; // we return NULL for LBA size > 4k
}

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
@ 2023-07-16  4:09 BELINDA Goodpaster kelly
  0 siblings, 0 replies; 67+ messages in thread
From: BELINDA Goodpaster kelly @ 2023-07-16  4:09 UTC (permalink / raw)
  To: tytso; +Cc: linux-block, linux-fsdevel, linux-mm, lsf-pc

[-- Attachment #1: Type: text/plain, Size: 1 bytes --]



[-- Attachment #2: Type: text/html, Size: 23 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2023-07-18  4:06 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-01  3:52 [LSF/MM/BPF TOPIC] Cloud storage optimizations Theodore Ts'o
2023-03-01  4:18 ` Gao Xiang
2023-03-01  4:40   ` Matthew Wilcox
2023-03-01  4:59     ` Gao Xiang
2023-03-01  4:35 ` Matthew Wilcox
2023-03-01  4:49   ` Gao Xiang
2023-03-01  5:01     ` Matthew Wilcox
2023-03-01  5:09       ` Gao Xiang
2023-03-01  5:19         ` Gao Xiang
2023-03-01  5:42         ` Matthew Wilcox
2023-03-01  5:51           ` Gao Xiang
2023-03-01  6:00             ` Gao Xiang
2023-03-02  3:13 ` Chaitanya Kulkarni
2023-03-02  3:50 ` Darrick J. Wong
2023-03-03  3:03   ` Martin K. Petersen
2023-03-02 20:30 ` Bart Van Assche
2023-03-03  3:05   ` Martin K. Petersen
2023-03-03  1:58 ` Keith Busch
2023-03-03  3:49   ` Matthew Wilcox
2023-03-03 11:32     ` Hannes Reinecke
2023-03-03 13:11     ` James Bottomley
2023-03-04  7:34       ` Matthew Wilcox
2023-03-04 13:41         ` James Bottomley
2023-03-04 16:39           ` Matthew Wilcox
2023-03-05  4:15             ` Luis Chamberlain
2023-03-05  5:02               ` Matthew Wilcox
2023-03-08  6:11                 ` Luis Chamberlain
2023-03-08  7:59                   ` Dave Chinner
2023-03-06 12:04               ` Hannes Reinecke
2023-03-06  3:50             ` James Bottomley
2023-03-04 19:04         ` Luis Chamberlain
2023-03-03 21:45     ` Luis Chamberlain
2023-03-03 22:07       ` Keith Busch
2023-03-03 22:14         ` Luis Chamberlain
2023-03-03 22:32           ` Keith Busch
2023-03-03 23:09             ` Luis Chamberlain
2023-03-16 15:29             ` Pankaj Raghav
2023-03-16 15:41               ` Pankaj Raghav
2023-03-03 23:51       ` Bart Van Assche
2023-03-04 11:08       ` Hannes Reinecke
2023-03-04 13:24         ` Javier González
2023-03-04 16:47         ` Matthew Wilcox
2023-03-04 17:17           ` Hannes Reinecke
2023-03-04 17:54             ` Matthew Wilcox
2023-03-04 18:53               ` Luis Chamberlain
2023-03-05  3:06               ` Damien Le Moal
2023-03-05 11:22               ` Hannes Reinecke
2023-03-06  8:23                 ` Matthew Wilcox
2023-03-06 10:05                   ` Hannes Reinecke
2023-03-06 16:12                   ` Theodore Ts'o
2023-03-08 17:53                     ` Matthew Wilcox
2023-03-08 18:13                       ` James Bottomley
2023-03-09  8:04                         ` Javier González
2023-03-09 13:11                           ` James Bottomley
2023-03-09 14:05                             ` Keith Busch
2023-03-09 15:23                             ` Martin K. Petersen
2023-03-09 20:49                               ` James Bottomley
2023-03-09 21:13                                 ` Luis Chamberlain
2023-03-09 21:28                                   ` Martin K. Petersen
2023-03-10  1:16                                     ` Dan Helmick
2023-03-10  7:59                             ` Javier González
2023-03-08 19:35                 ` Luis Chamberlain
2023-03-08 19:55                 ` Bart Van Assche
2023-03-03  2:54 ` Martin K. Petersen
2023-03-03  3:29   ` Keith Busch
2023-03-03  4:20   ` Theodore Ts'o
2023-07-16  4:09 BELINDA Goodpaster kelly

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.