linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* LSF/MM/BPF: 2024: Call for Proposals
       [not found] <7970ad75-ca6a-34b9-43ea-c6f67fe6eae6@iogearbox.net>
@ 2023-12-20 10:01 ` Daniel Borkmann
  2023-12-20 15:03   ` [LSF/MM/BPF TOPIC] Large block for I/O Hannes Reinecke
  2024-01-17 13:37   ` LSF/MM/BPF: 2024: Call for Proposals [Reminder] Daniel Borkmann
  0 siblings, 2 replies; 26+ messages in thread
From: Daniel Borkmann @ 2023-12-20 10:01 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-fsdevel, linux-mm, linux-block, linux-scsi, linux-nvme,
	bpf, netdev, linux-ide, linux-kernel

The annual Linux Storage, Filesystem, Memory Management, and BPF
(LSF/MM/BPF) Summit for 2024 will be held from May 13 to May 15
at the Hilton Salt Lake City Center in Salt Lake City, Utah, USA.

LSF/MM/BPF is an invitation-only technical workshop to map out
improvements to the Linux storage, filesystem, BPF, and memory
management subsystems that will make their way into the mainline
kernel within the coming years.

LSF/MM/BPF 2024 will be a three day, stand-alone conference with
four subsystem-specific tracks, cross-track discussions, as well
as BoF and hacking sessions:

          https://events.linuxfoundation.org/lsfmmbpf/

On behalf of the committee I am issuing a call for agenda proposals
that are suitable for cross-track discussion as well as technical
subjects for the breakout sessions.

If advance notice is required for visa applications then please
point that out in your proposal or request to attend, and submit
the topic as soon as possible.

We are asking that you please let us know you want to be invited
by March 1, 2024. We realize that travel is an ever changing target,
but it helps us to get an idea of possible attendance numbers.
Clearly things can and will change, so consider the request to
attend deadline more about planning and less about concrete plans.

1) Fill out the following Google form to request attendance and
suggest any topics for discussion:

          https://forms.gle/TGCgBDH1x5pXiWFo7

In previous years we have accidentally missed people's attendance
requests because they either did not Cc lsf-pc@ or we simply missed
them in the flurry of emails we get. Our community is large and our
volunteers are busy, filling this out will help us to make sure we
do not miss anybody.

2) Proposals for agenda topics should ideally still be sent to the
following lists to allow for discussion among your peers. This will
help us figure out which topics are important for the agenda:

          lsf-pc@lists.linux-foundation.org

... and Cc the mailing lists that are relevant for the topic in
question:

          FS:     linux-fsdevel@vger.kernel.org
          MM:     linux-mm@kvack.org
          Block:  linux-block@vger.kernel.org
          ATA:    linux-ide@vger.kernel.org
          SCSI:   linux-scsi@vger.kernel.org
          NVMe:   linux-nvme@lists.infradead.org
          BPF:    bpf@vger.kernel.org

Please tag your proposal with [LSF/MM/BPF TOPIC] to make it easier
to track. In addition, please make sure to start a new thread for
each topic rather than following up to an existing one. Agenda
topics and attendees will be selected by the program committee,
but the final agenda will be formed by consensus of the attendees
on the day.

3) This year we would also like to try and make sure we are
including new members in the community that the program committee
may not be familiar with. The Google form has an area for people to
add required/optional attendees. Please encourage new members of the
community to submit a request for an invite as well, but additionally
if maintainers or long term community members could add nominees to
the form it would help us make sure that new members get the proper
consideration.

For discussion leaders, slides and visualizations are encouraged to
outline the subject matter and focus the discussions. Please refrain
from lengthy presentations and talks in order for sessions to be
productive; the sessions are supposed to be interactive, inclusive
discussions.

We are still looking into the virtual component. We will likely run
something similar to what we did last year, but details on that will
be forthcoming.

2023: https://lwn.net/Articles/lsfmmbpf2023/

2022: https://lwn.net/Articles/lsfmm2022/

2019: https://lwn.net/Articles/lsfmm2019/

2018: https://lwn.net/Articles/lsfmm2018/

2017: https://lwn.net/Articles/lsfmm2017/

2016: https://lwn.net/Articles/lsfmm2016/

2015: https://lwn.net/Articles/lsfmm2015/

2014: http://lwn.net/Articles/LSFMM2014/

4) If you have feedback on last year's meeting that we can use to
improve this year's, please also send that to:

          lsf-pc@lists.linux-foundation.org

Thank you on behalf of the program committee:

          Amir Goldstein (Filesystems)
          Jan Kara (Filesystems)
          Martin K. Petersen (Storage)
          Javier González (Storage)
          Michal Hocko (MM)
          Dan Williams (MM)
          Daniel Borkmann (BPF)
          Martin KaFai Lau (BPF)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [LSF/MM/BPF TOPIC] Large block for I/O
  2023-12-20 10:01 ` LSF/MM/BPF: 2024: Call for Proposals Daniel Borkmann
@ 2023-12-20 15:03   ` Hannes Reinecke
  2023-12-21 20:33     ` Bart Van Assche
  2024-02-23 16:41     ` Pankaj Raghav (Samsung)
  2024-01-17 13:37   ` LSF/MM/BPF: 2024: Call for Proposals [Reminder] Daniel Borkmann
  1 sibling, 2 replies; 26+ messages in thread
From: Hannes Reinecke @ 2023-12-20 15:03 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-mm, linux-block, linux-scsi, linux-nvme

Hi all,

I would like to discuss

Large blocks for I/O

Since the presentation last year there has been quite some developments
and improvements in some areas, but at the same time a lack of progress
in other areas.
In this presentation/discussion I would like to highlight the current
state of affairs, existing pain points, and future directions of 
development.
It might be an idea to co-locate it with the MM folks as we do have
quite some overlap with page-cache improvements and hugepage handling.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large block for I/O
  2023-12-20 15:03   ` [LSF/MM/BPF TOPIC] Large block for I/O Hannes Reinecke
@ 2023-12-21 20:33     ` Bart Van Assche
  2023-12-21 20:42       ` Matthew Wilcox
                         ` (3 more replies)
  2024-02-23 16:41     ` Pankaj Raghav (Samsung)
  1 sibling, 4 replies; 26+ messages in thread
From: Bart Van Assche @ 2023-12-21 20:33 UTC (permalink / raw)
  To: Hannes Reinecke, lsf-pc; +Cc: linux-mm, linux-block, linux-scsi, linux-nvme

On 12/20/23 07:03, Hannes Reinecke wrote:
> I would like to discuss
> 
> Large blocks for I/O
> 
> Since the presentation last year there has been quite some developments
> and improvements in some areas, but at the same time a lack of progress
> in other areas.
> In this presentation/discussion I would like to highlight the current
> state of affairs, existing pain points, and future directions of development.
> It might be an idea to co-locate it with the MM folks as we do have
> quite some overlap with page-cache improvements and hugepage handling.

Hi Hannes,

I'm interested in this topic. But I'm wondering whether the disadvantages of
large blocks will be covered? Some NAND storage vendors are less than
enthusiast about increasing the logical block size beyond 4 KiB because it
increases the size of many writes to the device and hence increases write
amplification.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large block for I/O
  2023-12-21 20:33     ` Bart Van Assche
@ 2023-12-21 20:42       ` Matthew Wilcox
  2023-12-21 21:00         ` Bart Van Assche
  2023-12-22  5:09       ` Christoph Hellwig
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 26+ messages in thread
From: Matthew Wilcox @ 2023-12-21 20:42 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Hannes Reinecke, lsf-pc, linux-mm, linux-block, linux-scsi, linux-nvme

On Thu, Dec 21, 2023 at 12:33:08PM -0800, Bart Van Assche wrote:
> On 12/20/23 07:03, Hannes Reinecke wrote:
> > I would like to discuss
> > 
> > Large blocks for I/O
> > 
> > Since the presentation last year there has been quite some developments
> > and improvements in some areas, but at the same time a lack of progress
> > in other areas.
> > In this presentation/discussion I would like to highlight the current
> > state of affairs, existing pain points, and future directions of development.
> > It might be an idea to co-locate it with the MM folks as we do have
> > quite some overlap with page-cache improvements and hugepage handling.
> 
> Hi Hannes,
> 
> I'm interested in this topic. But I'm wondering whether the disadvantages of
> large blocks will be covered? Some NAND storage vendors are less than
> enthusiast about increasing the logical block size beyond 4 KiB because it
> increases the size of many writes to the device and hence increases write
> amplification.

It's LSF/MM.  If this session is being run as a presentation rather than
discussion, it's being done wrongly.  So if you want to talk about the
downsides, show up and talk about them.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large block for I/O
  2023-12-21 20:42       ` Matthew Wilcox
@ 2023-12-21 21:00         ` Bart Van Assche
  0 siblings, 0 replies; 26+ messages in thread
From: Bart Van Assche @ 2023-12-21 21:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hannes Reinecke, lsf-pc, linux-mm, linux-block, linux-scsi, linux-nvme

On 12/21/23 12:42, Matthew Wilcox wrote:
> So if you want to talk about the downsides, show up and talk about them.

If I receive an invitation for the LSF/MM/BPF summit I will show up :-)

Bart.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large block for I/O
  2023-12-21 20:33     ` Bart Van Assche
  2023-12-21 20:42       ` Matthew Wilcox
@ 2023-12-22  5:09       ` Christoph Hellwig
  2023-12-22  5:13       ` Matthew Wilcox
  2023-12-22  8:23       ` Viacheslav Dubeyko
  3 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2023-12-22  5:09 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Hannes Reinecke, lsf-pc, linux-mm, linux-block, linux-scsi, linux-nvme

On Thu, Dec 21, 2023 at 12:33:08PM -0800, Bart Van Assche wrote:
> I'm interested in this topic. But I'm wondering whether the disadvantages of
> large blocks will be covered? Some NAND storage vendors are less than
> enthusiast about increasing the logical block size beyond 4 KiB because it
> increases the size of many writes to the device and hence increases write
> amplification.

Then they should not increase the logical block size for the products
where they worry about it.  It's not like larger blocks are a feature
the Linux wants, it's want that makes hardware vendors life easier and
is thus pushed by them.  Of course it doesn't make sense for every
product line, but it's not like Linux is going to stop supporting
512 byte or 4k blocks.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large block for I/O
  2023-12-21 20:33     ` Bart Van Assche
  2023-12-21 20:42       ` Matthew Wilcox
  2023-12-22  5:09       ` Christoph Hellwig
@ 2023-12-22  5:13       ` Matthew Wilcox
  2023-12-22  5:37         ` Christoph Hellwig
  2023-12-22  8:23       ` Viacheslav Dubeyko
  3 siblings, 1 reply; 26+ messages in thread
From: Matthew Wilcox @ 2023-12-22  5:13 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Hannes Reinecke, lsf-pc, linux-mm, linux-block, linux-scsi, linux-nvme

On Thu, Dec 21, 2023 at 12:33:08PM -0800, Bart Van Assche wrote:
> I'm interested in this topic. But I'm wondering whether the disadvantages of
> large blocks will be covered? Some NAND storage vendors are less than
> enthusiast about increasing the logical block size beyond 4 KiB because it
> increases the size of many writes to the device and hence increases write
> amplification.

I've been mulling this over for a few hours and I don't really understand
it.  The push for larger block sizes is coming from (some) storage
vendors.  If it doesn't make sense for (other) storage vendors, they
don't have to do it.  Just like nobody is forced to ship shingled drives,
or vertical NAND or four-bit-per-cell or fill their drives with helium.
Vendors do it if it makes sense for them, and don't if it doesn't.

It clearly solves a problem (and the one I think it's solving is the
size of the FTL map).  But I can't see why we should stop working on it,
just because not all drive manufacturers want to support it.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large block for I/O
  2023-12-22  5:13       ` Matthew Wilcox
@ 2023-12-22  5:37         ` Christoph Hellwig
  2024-01-08 19:30           ` Bart Van Assche
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2023-12-22  5:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Bart Van Assche, Hannes Reinecke, lsf-pc, linux-mm, linux-block,
	linux-scsi, linux-nvme

On Fri, Dec 22, 2023 at 05:13:43AM +0000, Matthew Wilcox wrote:
> It clearly solves a problem (and the one I think it's solving is the
> size of the FTL map).  But I can't see why we should stop working on it,
> just because not all drive manufacturers want to support it.

I don't think it is drive vendors.  It is is the SSD divisions which
all pretty much love it (for certain use cases) vs the UFS/eMMC
divisions which tends to often be fearful and less knowledgeable (to
say it nicely) no matter what vendor you're talking to.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large block for I/O
  2023-12-21 20:33     ` Bart Van Assche
                         ` (2 preceding siblings ...)
  2023-12-22  5:13       ` Matthew Wilcox
@ 2023-12-22  8:23       ` Viacheslav Dubeyko
  2023-12-22 12:29         ` Hannes Reinecke
  2023-12-22 15:10         ` Keith Busch
  3 siblings, 2 replies; 26+ messages in thread
From: Viacheslav Dubeyko @ 2023-12-22  8:23 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Hannes Reinecke, lsf-pc, linux-mm, linux-block, linux-scsi, linux-nvme



> On Dec 21, 2023, at 11:33 PM, Bart Van Assche <bvanassche@acm.org> wrote:
> 

<skipped>

>> .
> 
> Hi Hannes,
> 
> I'm interested in this topic. But I'm wondering whether the disadvantages of
> large blocks will be covered? Some NAND storage vendors are less than
> enthusiast about increasing the logical block size beyond 4 KiB because it
> increases the size of many writes to the device and hence increases write
> amplification.
> 

I  am also interested in this discussion. Every SSD manufacturer carefully hides
the details of architecture and FTL’s behavior. I believe that switching on bigger
logical size (like 8KB, 16KB, etc) could be even better for SSD's internal mapping
scheme and erase blocks management. I assume that it could require significant
reworking the firmware and, potentially, ASIC logic. This could be the main pain
for SSD manufactures. Frankly speaking, I don’t see the direct relation between
increasing logical block size and increasing write amplification. If you have 16KB
logical block size on SSD side and file system will continue to use 4KB logical
block size, then, yes, I can see the problem. But if file system manages the space
in 16KB logical blocks and carefully issue the I/O requests of proper size, then
everything should be good. Again, FTL is simply trying to write logical blocks into
erase block. And we have, for example, 8MB erase block, then mapping and writing
16KB logical blocks looks like more beneficial operation compared with 4KB logical
block.

So, I see more troubles on file systems side to support bigger logical size. For example,
we discussed the 8KB folio size support recently. Matthew already shared the patch
for supporting 8KB folio size, but everything should be carefully tested. Also, I experienced
the issue with read ahead logic. For example, if I format my file system volume with 32KB
logical block, then read ahead logic returns to me 16KB folios that was slightly surprising
to me. So, I assume we can find a lot of potential issues on file systems side for bigger
logical size from the point of view of efficiency of metadata and user data operations.
Also, high-loaded systems could have fragmented memory that could make the memory
allocation more tricky operation. I mean here that it could be not easy to allocate one big
folio. Log-structured file systems can easily aligned write I/O requests for bigger logical
size. But in-place update file systems can increase write amplification for bigger logical
size because of necessity to flush bigger portion of data for small modification. However,
FTL can use delta-encoding and smart logic of compaction several logical blocks into
one NAND flash page. And, by the way, NAND flash page usually is bigger than 4KB.

Thanks,
Slava.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large block for I/O
  2023-12-22  8:23       ` Viacheslav Dubeyko
@ 2023-12-22 12:29         ` Hannes Reinecke
  2023-12-22 13:29           ` Matthew Wilcox
  2023-12-22 15:10         ` Keith Busch
  1 sibling, 1 reply; 26+ messages in thread
From: Hannes Reinecke @ 2023-12-22 12:29 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Bart Van Assche, Matthew Wilcox
  Cc: lsf-pc, linux-mm, linux-block, linux-scsi, linux-nvme

On 12/22/23 09:23, Viacheslav Dubeyko wrote:
> 
> 
>> On Dec 21, 2023, at 11:33 PM, Bart Van Assche <bvanassche@acm.org> wrote:
>>
> 
> <skipped>
> 
>>> .
>>
>> Hi Hannes,
>>
>> I'm interested in this topic. But I'm wondering whether the disadvantages of
>> large blocks will be covered? Some NAND storage vendors are less than
>> enthusiast about increasing the logical block size beyond 4 KiB because it
>> increases the size of many writes to the device and hence increases write
>> amplification.
>>
> 
> I  am also interested in this discussion. Every SSD manufacturer carefully hides
> the details of architecture and FTL’s behavior. I believe that switching on bigger
> logical size (like 8KB, 16KB, etc) could be even better for SSD's internal mapping
> scheme and erase blocks management. I assume that it could require significant
> reworking the firmware and, potentially, ASIC logic. This could be the main pain
> for SSD manufactures. Frankly speaking, I don’t see the direct relation between
> increasing logical block size and increasing write amplification. If you have 16KB
> logical block size on SSD side and file system will continue to use 4KB logical
> block size, then, yes, I can see the problem. But if file system manages the space
> in 16KB logical blocks and carefully issue the I/O requests of proper size, then
> everything should be good. Again, FTL is simply trying to write logical blocks into
> erase block. And we have, for example, 8MB erase block, then mapping and writing
> 16KB logical blocks looks like more beneficial operation compared with 4KB logical
> block.
> 
> So, I see more troubles on file systems side to support bigger logical size. For example,
> we discussed the 8KB folio size support recently. Matthew already shared the patch
> for supporting 8KB folio size, but everything should be carefully tested. Also, I experienced
> the issue with read ahead logic. For example, if I format my file system volume with 32KB
> logical block, then read ahead logic returns to me 16KB folios that was slightly surprising
> to me. So, I assume we can find a lot of potential issues on file systems side for bigger
> logical size from the point of view of efficiency of metadata and user data operations.
> Also, high-loaded systems could have fragmented memory that could make the memory
> allocation more tricky operation. I mean here that it could be not easy to allocate one big
> folio. Log-structured file systems can easily aligned write I/O requests for bigger logical
> size. But in-place update file systems can increase write amplification for bigger logical
> size because of necessity to flush bigger portion of data for small modification. However,
> FTL can use delta-encoding and smart logic of compaction several logical blocks into
> one NAND flash page. And, by the way, NAND flash page usually is bigger than 4KB.
> 
And that is actually a very valid point; memory fragmentation will 
become an issue with larger block sizes.

Theoretically it should be quite easily solved; just switch the memory 
subsystem to use the largest block size in the system, and run every 
smaller memory allocation via SLUB (or whatever the allocator-of-the-day
currently is :-). Then trivially the system will never be fragmented,
and I/O can always use large folios.

However, that means to do away with alloc_page(), which is still in 
widespread use throughout the kernel. I would actually in favour of it,
but it might be that mm people have a different view.

Matthew, worth a new topic?
Handling memory fragmentation on large block I/O systems?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large block for I/O
  2023-12-22 12:29         ` Hannes Reinecke
@ 2023-12-22 13:29           ` Matthew Wilcox
  0 siblings, 0 replies; 26+ messages in thread
From: Matthew Wilcox @ 2023-12-22 13:29 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Viacheslav Dubeyko, Bart Van Assche, lsf-pc, linux-mm,
	linux-block, linux-scsi, linux-nvme

On Fri, Dec 22, 2023 at 01:29:18PM +0100, Hannes Reinecke wrote:
> And that is actually a very valid point; memory fragmentation will become an
> issue with larger block sizes.
> 
> Theoretically it should be quite easily solved; just switch the memory
> subsystem to use the largest block size in the system, and run every smaller
> memory allocation via SLUB (or whatever the allocator-of-the-day
> currently is :-). Then trivially the system will never be fragmented,
> and I/O can always use large folios.
> 
> However, that means to do away with alloc_page(), which is still in
> widespread use throughout the kernel. I would actually in favour of it,
> but it might be that mm people have a different view.
> 
> Matthew, worth a new topic?
> Handling memory fragmentation on large block I/O systems?

I think if we're going to do that as a topic (and I'm not opposed!),
we need data.  Various workloads, various block sizes, etc.  Right now
people discuss this topic with "feelings" and "intuition" and I think
we need more than vibes to have a productive discussion.

My laptop (rebooted last night due to an unfortunate upgrade that left
anything accessing the sound device hanging ...):

MemTotal:       16006344 kB
MemFree:         2353108 kB
Cached:          7957552 kB
AnonPages:       4271088 kB
Slab:             654896 kB

so ~50% of my 16GB of memory is in the page cache and ~25% is anon memory.
If the page cache is all in 16kB chunks and we need to allocate order-2
folios in order to read from a file, we can find it easily by reclaiming
other order-2 folios from the page cache.  We don't need to resort to
heroics like eliminating use of alloc_page().

We should eliminate use of alloc_page() across most of the kernel, but
that's a different topic and one that has not much relevance to LSF/MM
since it's drivers that need to change, not the MM ;-)

Now, other people "feel" differently.  And that's cool, but we're not
going to have a productive discussion without data that shows whose
feelings represent reality and for which kinds of workloads.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large block for I/O
  2023-12-22  8:23       ` Viacheslav Dubeyko
  2023-12-22 12:29         ` Hannes Reinecke
@ 2023-12-22 15:10         ` Keith Busch
  2023-12-22 16:06           ` Matthew Wilcox
  2023-12-25  8:12           ` Viacheslav Dubeyko
  1 sibling, 2 replies; 26+ messages in thread
From: Keith Busch @ 2023-12-22 15:10 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: Bart Van Assche, Hannes Reinecke, lsf-pc, linux-mm, linux-block,
	linux-scsi, linux-nvme

On Fri, Dec 22, 2023 at 11:23:26AM +0300, Viacheslav Dubeyko wrote:
> > On Dec 21, 2023, at 11:33 PM, Bart Van Assche <bvanassche@acm.org> wrote:
> > I'm interested in this topic. But I'm wondering whether the disadvantages of
> > large blocks will be covered? Some NAND storage vendors are less than
> > enthusiast about increasing the logical block size beyond 4 KiB because it
> > increases the size of many writes to the device and hence increases write
> > amplification.
> > 
> 
> I  am also interested in this discussion. Every SSD manufacturer carefully hides
> the details of architecture and FTL´s behavior. I believe that switching on bigger
> logical size (like 8KB, 16KB, etc) could be even better for SSD's internal mapping
> scheme and erase blocks management. I assume that it could require significant
> reworking the firmware and, potentially, ASIC logic. This could be the main pain
> for SSD manufactures. Frankly speaking, I don´t see the direct relation between
> increasing logical block size and increasing write amplification. If you have 16KB
> logical block size on SSD side and file system will continue to use 4KB logical
> block size, then, yes, I can see the problem. But if file system manages the space
> in 16KB logical blocks and carefully issue the I/O requests of proper size, then
> everything should be good. Again, FTL is simply trying to write logical blocks into
> erase block. And we have, for example, 8MB erase block, then mapping and writing
> 16KB logical blocks looks like more beneficial operation compared with 4KB logical
> block.

If the host really wants to write in small granularities, then larger
block sizes just shifts the write amplification from the device to the
host, which seems worse than letting the device deal with it.

I've done some early profiling on my fleet and there are definitely
applications that overwhelming prefer larger writes. Those should be
great candidates to use these kinds of logical block formats. It's
already flash-friendly, but aligning filesystems and memory management
to the same granularity is a nice plus.

Other applications, though, still need 4k writes. Turning those to RMW
on the host to modify 4k in the middle of a 16k block is obviously a bad
fit.

Anyway, your mileage may vary. This example BPF program provides an okay
starting point for examining disk usage to see if large logical block
sizes are a good fit for your application:

  https://github.com/iovisor/bpftrace/blob/master/tools/bitesize.bt
 
> So, I see more troubles on file systems side to support bigger logical
> size. For example, we discussed the 8KB folio size support recently.
> Matthew already shared the patch for supporting 8KB folio size, but
> everything should be carefully tested. Also, I experienced the issue
> with read ahead logic. For example, if I format my file system volume
> with 32KB logical block, then read ahead logic returns to me 16KB
> folios that was slightly surprising to me. So, I assume we can find a
> lot of potential issues on file systems side for bigger logical size
> from the point of view of efficiency of metadata and user data
> operations.  Also, high-loaded systems could have fragmented memory
> that could make the memory allocation more tricky operation. I mean
> here that it could be not easy to allocate one big folio.
> Log-structured file systems can easily aligned write I/O requests for
> bigger logical size. But in-place update file systems can increase
> write amplification for bigger logical size because of necessity to
> flush bigger portion of data for small modification. However, FTL can
> use delta-encoding and smart logic of compaction several logical
> blocks into one NAND flash page. And, by the way, NAND flash page
> usually is bigger than 4KB.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large block for I/O
  2023-12-22 15:10         ` Keith Busch
@ 2023-12-22 16:06           ` Matthew Wilcox
  2023-12-25  8:55             ` Viacheslav Dubeyko
  2023-12-25  8:12           ` Viacheslav Dubeyko
  1 sibling, 1 reply; 26+ messages in thread
From: Matthew Wilcox @ 2023-12-22 16:06 UTC (permalink / raw)
  To: Keith Busch
  Cc: Viacheslav Dubeyko, Bart Van Assche, Hannes Reinecke, lsf-pc,
	linux-mm, linux-block, linux-scsi, linux-nvme

On Fri, Dec 22, 2023 at 08:10:54AM -0700, Keith Busch wrote:
> If the host really wants to write in small granularities, then larger
> block sizes just shifts the write amplification from the device to the
> host, which seems worse than letting the device deal with it.

Maybe?  I'm never sure about that.  See, if the drive is actually
managing the flash in 16kB chunks internally, then the drive has to do a
RMW which is increased latency over the host just doing a 16kB write,
which can go straight to flash.  Assuming the host has the whole 16kB in
memory (likely?)  Of course, if you're PCIe bandwidth limited, then a
4kB write looks more attractive, but generally I think drives tend to
be IOPS limited not bandwidth limited today?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large block for I/O
  2023-12-22 15:10         ` Keith Busch
  2023-12-22 16:06           ` Matthew Wilcox
@ 2023-12-25  8:12           ` Viacheslav Dubeyko
  1 sibling, 0 replies; 26+ messages in thread
From: Viacheslav Dubeyko @ 2023-12-25  8:12 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bart Van Assche, Hannes Reinecke, lsf-pc, linux-mm, linux-block,
	linux-scsi, linux-nvme



> On Dec 22, 2023, at 6:10 PM, Keith Busch <kbusch@kernel.org> wrote:
> 
> 

<skipped>

> 
> Other applications, though, still need 4k writes. Turning those to RMW
> on the host to modify 4k in the middle of a 16k block is obviously a bad
> fit.

So, if application doesn’t work with raw device directly or not use O_DIRECT,
then we always have file system’s page cache in the middle. It sounds like 4K
write operation makes dirty the whole 16K logical block, from file system point
of view. Finally, file system will need to flush the whole 16K logical block, even
if 4k modification was only in the middle of 16K. Potentially, it could sound like
increasing write amplification. However, usually, metadata could require smaller
granularity (like 4K). But metadata is frequently updated type of data. So, there is
significant probability that, at average, 16K logical block with metadata can be
evenly updated by 4K write operations before flush operation. If we have cold user
data, then logical block size doesn’t matter because write operation can be aligned.
I assume that frequently updated user data could be localized at some file’s area(s).
It means that 16K logical block size could gather several 4K frequently updated areas
Theoretically, it is possible to imagine really nasty even distribution of 4K updates
through the whole file with holes in between, but it looks like some stress testing or
benchmarking, but not real-life use-case or workload.

Let’s imagine that application writes directly to raw device by 4K I/O operations.
If block device supports 16K physical sector size, then can we write by 4K I/O
operations? From another point of view, if I know that my application updates by
4K I/O, then what’s the point to use device with 16K physical sector size, for example.
I hope we will have opportunity to make a choice between devices that supports 4K and
16K physical sector sizes. But, technically speaking, storage device usually receives
multiple I/O requests at the same time. Even if it receives 4K updates for different
LBAs, then it is possible to combine several 4K updates into 16K NAND flash page.
The question here is how to map the updates into LBAs efficiently. Because, the main
FTL’s responsibility is mapping (LBA into erase blocks, for example).

Thanks,
Slava.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large block for I/O
  2023-12-22 16:06           ` Matthew Wilcox
@ 2023-12-25  8:55             ` Viacheslav Dubeyko
  0 siblings, 0 replies; 26+ messages in thread
From: Viacheslav Dubeyko @ 2023-12-25  8:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Keith Busch, Bart Van Assche, Hannes Reinecke, lsf-pc, linux-mm,
	linux-block, linux-scsi, linux-nvme



> On Dec 22, 2023, at 7:06 PM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Fri, Dec 22, 2023 at 08:10:54AM -0700, Keith Busch wrote:
>> If the host really wants to write in small granularities, then larger
>> block sizes just shifts the write amplification from the device to the
>> host, which seems worse than letting the device deal with it.
> 
> Maybe?  I'm never sure about that.  See, if the drive is actually
> managing the flash in 16kB chunks internally, then the drive has to do a
> RMW which is increased latency over the host just doing a 16kB write,
> which can go straight to flash.  Assuming the host has the whole 16kB in
> memory (likely?)  Of course, if you're PCIe bandwidth limited, then a
> 4kB write looks more attractive, but generally I think drives tend to
> be IOPS limited not bandwidth limited today?
> 

Fundamentally, if storage device supports 16K physical sector size, then
I am not sure that we can write by 4K I/O requests. It means that we should
read 16K LBA into page cache or application’s buffer before any write
operation. So, I see potential RMW inside of storage device only if device
is capable to manage 4K I/O requests even if physical sector is 16K.
But is it real life use-case?

I am not sure about attractiveness of 4K write operations. Usually, file system
provides the way to configure an internal logical block size and metadata
granularities. Finally, it is possible to align the internal metadata and user data
granularities on 16K size, for example. An if we are talking about metadata
structures (for example, inodes table, block mapping, etc), then it’s frequently
updated data. So, 16K will most probably contains several updated 4K pieces.
And, as a result, we have to flush all these updated metadata, anyway, despite
PCIe bandwidth limitation (even if we have some). Also, I assume that to send
16K I/O request could be more beneficial that several 4K I/O requests. Of course,
real life is more complicated. 

Thanks,
Slava.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large block for I/O
  2023-12-22  5:37         ` Christoph Hellwig
@ 2024-01-08 19:30           ` Bart Van Assche
  2024-01-08 19:35             ` Matthew Wilcox
  0 siblings, 1 reply; 26+ messages in thread
From: Bart Van Assche @ 2024-01-08 19:30 UTC (permalink / raw)
  To: Christoph Hellwig, Matthew Wilcox
  Cc: Hannes Reinecke, lsf-pc, linux-mm, linux-block, linux-scsi, linux-nvme

On 12/21/23 21:37, Christoph Hellwig wrote:
> On Fri, Dec 22, 2023 at 05:13:43AM +0000, Matthew Wilcox wrote:
>> It clearly solves a problem (and the one I think it's solving is the
>> size of the FTL map).  But I can't see why we should stop working on it,
>> just because not all drive manufacturers want to support it.
> 
> I don't think it is drive vendors.  It is is the SSD divisions which
> all pretty much love it (for certain use cases) vs the UFS/eMMC
> divisions which tends to often be fearful and less knowledgeable (to
> say it nicely) no matter what vendor you're talking to.

Hi Christoph,

If there is a significant number of 4 KiB writes in a workload (e.g.
filesystem metadata writes), and the logical block size is increased from
4 KiB to 16 KiB, this will increase write amplification no matter how the
SSD storage controller has been designed, isn't it? Is there perhaps
something that I'm misunderstanding?

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large block for I/O
  2024-01-08 19:30           ` Bart Van Assche
@ 2024-01-08 19:35             ` Matthew Wilcox
  2024-02-22 18:45               ` Luis Chamberlain
  0 siblings, 1 reply; 26+ messages in thread
From: Matthew Wilcox @ 2024-01-08 19:35 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Hannes Reinecke, lsf-pc, linux-mm,
	linux-block, linux-scsi, linux-nvme

On Mon, Jan 08, 2024 at 11:30:10AM -0800, Bart Van Assche wrote:
> On 12/21/23 21:37, Christoph Hellwig wrote:
> > On Fri, Dec 22, 2023 at 05:13:43AM +0000, Matthew Wilcox wrote:
> > > It clearly solves a problem (and the one I think it's solving is the
> > > size of the FTL map).  But I can't see why we should stop working on it,
> > > just because not all drive manufacturers want to support it.
> > 
> > I don't think it is drive vendors.  It is is the SSD divisions which
> > all pretty much love it (for certain use cases) vs the UFS/eMMC
> > divisions which tends to often be fearful and less knowledgeable (to
> > say it nicely) no matter what vendor you're talking to.
> 
> Hi Christoph,
> 
> If there is a significant number of 4 KiB writes in a workload (e.g.
> filesystem metadata writes), and the logical block size is increased from
> 4 KiB to 16 KiB, this will increase write amplification no matter how the
> SSD storage controller has been designed, isn't it? Is there perhaps
> something that I'm misunderstanding?

You're misunderstanding that it's the _drive_ which gets to decide the
logical block size.  Filesystems literally can't do 4kB writes to these
drives; you can't do a write smaller than a block.  If your clients
don't think it's a good tradeoff for them, they won't tell Linux that
the minimum IO size is 16kB.

Some workloads are better with a 4kB block size, no doubt.  Others are
better with a 512 byte block size.  That doesn't prevent vendors from
offering 4kB LBA size drives.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* LSF/MM/BPF: 2024: Call for Proposals [Reminder]
  2023-12-20 10:01 ` LSF/MM/BPF: 2024: Call for Proposals Daniel Borkmann
  2023-12-20 15:03   ` [LSF/MM/BPF TOPIC] Large block for I/O Hannes Reinecke
@ 2024-01-17 13:37   ` Daniel Borkmann
  2024-02-14 13:03     ` LSF/MM/BPF: 2024: Call for Proposals [Final Reminder] Daniel Borkmann
  1 sibling, 1 reply; 26+ messages in thread
From: Daniel Borkmann @ 2024-01-17 13:37 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-fsdevel, linux-mm, linux-block, linux-scsi, linux-nvme,
	bpf, netdev, linux-ide, linux-kernel

The annual Linux Storage, Filesystem, Memory Management, and BPF
(LSF/MM/BPF) Summit for 2024 will be held from May 13 to May 15
at the Hilton Salt Lake City Center in Salt Lake City, Utah, USA.

LSF/MM/BPF is an invitation-only technical workshop to map out
improvements to the Linux storage, filesystem, BPF, and memory
management subsystems that will make their way into the mainline
kernel within the coming years.

LSF/MM/BPF 2024 will be a three day, stand-alone conference with
four subsystem-specific tracks, cross-track discussions, as well
as BoF and hacking sessions:

          https://events.linuxfoundation.org/lsfmmbpf/

On behalf of the committee I am issuing a call for agenda proposals
that are suitable for cross-track discussion as well as technical
subjects for the breakout sessions.

If advance notice is required for visa applications then please
point that out in your proposal or request to attend, and submit
the topic as soon as possible.

We are asking that you please let us know you want to be invited
by March 1, 2024. We realize that travel is an ever changing target,
but it helps us to get an idea of possible attendance numbers.
Clearly things can and will change, so consider the request to
attend deadline more about planning and less about concrete plans.

1) Fill out the following Google form to request attendance and
suggest any topics for discussion:

          https://forms.gle/TGCgBDH1x5pXiWFo7

In previous years we have accidentally missed people's attendance
requests because they either did not Cc lsf-pc@ or we simply missed
them in the flurry of emails we get. Our community is large and our
volunteers are busy, filling this out will help us to make sure we
do not miss anybody.

2) Proposals for agenda topics should ideally still be sent to the
following lists to allow for discussion among your peers. This will
help us figure out which topics are important for the agenda:

          lsf-pc@lists.linux-foundation.org

... and Cc the mailing lists that are relevant for the topic in
question:

          FS:     linux-fsdevel@vger.kernel.org
          MM:     linux-mm@kvack.org
          Block:  linux-block@vger.kernel.org
          ATA:    linux-ide@vger.kernel.org
          SCSI:   linux-scsi@vger.kernel.org
          NVMe:   linux-nvme@lists.infradead.org
          BPF:    bpf@vger.kernel.org

Please tag your proposal with [LSF/MM/BPF TOPIC] to make it easier
to track. In addition, please make sure to start a new thread for
each topic rather than following up to an existing one. Agenda
topics and attendees will be selected by the program committee,
but the final agenda will be formed by consensus of the attendees
on the day.

3) This year we would also like to try and make sure we are
including new members in the community that the program committee
may not be familiar with. The Google form has an area for people to
add required/optional attendees. Please encourage new members of the
community to submit a request for an invite as well, but additionally
if maintainers or long term community members could add nominees to
the form it would help us make sure that new members get the proper
consideration.

For discussion leaders, slides and visualizations are encouraged to
outline the subject matter and focus the discussions. Please refrain
from lengthy presentations and talks in order for sessions to be
productive; the sessions are supposed to be interactive, inclusive
discussions.

We are still looking into the virtual component. We will likely run
something similar to what we did last year, but details on that will
be forthcoming.

2023: https://lwn.net/Articles/lsfmmbpf2023/

2022: https://lwn.net/Articles/lsfmm2022/

2019: https://lwn.net/Articles/lsfmm2019/

2018: https://lwn.net/Articles/lsfmm2018/

2017: https://lwn.net/Articles/lsfmm2017/

2016: https://lwn.net/Articles/lsfmm2016/

2015: https://lwn.net/Articles/lsfmm2015/

2014: http://lwn.net/Articles/LSFMM2014/

4) If you have feedback on last year's meeting that we can use to
improve this year's, please also send that to:

          lsf-pc@lists.linux-foundation.org

Thank you on behalf of the program committee:

          Amir Goldstein (Filesystems)
          Jan Kara (Filesystems)
          Martin K. Petersen (Storage)
          Javier González (Storage)
          Michal Hocko (MM)
          Dan Williams (MM)
          Daniel Borkmann (BPF)
          Martin KaFai Lau (BPF)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* LSF/MM/BPF: 2024: Call for Proposals [Final Reminder]
  2024-01-17 13:37   ` LSF/MM/BPF: 2024: Call for Proposals [Reminder] Daniel Borkmann
@ 2024-02-14 13:03     ` Daniel Borkmann
  0 siblings, 0 replies; 26+ messages in thread
From: Daniel Borkmann @ 2024-02-14 13:03 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-scsi, linux-ide, netdev, linux-kernel, linux-nvme,
	linux-block, linux-mm, linux-fsdevel, bpf

The annual Linux Storage, Filesystem, Memory Management, and BPF
(LSF/MM/BPF) Summit for 2024 will be held from May 13 to May 15
at the Hilton Salt Lake City Center in Salt Lake City, Utah, USA.

LSF/MM/BPF is an invitation-only technical workshop to map out
improvements to the Linux storage, filesystem, BPF, and memory
management subsystems that will make their way into the mainline
kernel within the coming years.

LSF/MM/BPF 2024 will be a three day, stand-alone conference with
four subsystem-specific tracks, cross-track discussions, as well
as BoF and hacking sessions:

          https://events.linuxfoundation.org/lsfmmbpf/

On behalf of the committee I am issuing a call for agenda proposals
that are suitable for cross-track discussion as well as technical
subjects for the breakout sessions.

If advance notice is required for visa applications then please
point that out in your proposal or request to attend, and submit
the topic as soon as possible.

We are asking that you please let us know you want to be invited
by March 1, 2024. We realize that travel is an ever changing target,
but it helps us to get an idea of possible attendance numbers.
Clearly things can and will change, so consider the request to
attend deadline more about planning and less about concrete plans.

1) Fill out the following Google form to request attendance and
suggest any topics for discussion:

          https://forms.gle/TGCgBDH1x5pXiWFo7

In previous years we have accidentally missed people's attendance
requests because they either did not Cc lsf-pc@ or we simply missed
them in the flurry of emails we get. Our community is large and our
volunteers are busy, filling this out will help us to make sure we
do not miss anybody.

2) Proposals for agenda topics should ideally still be sent to the
following lists to allow for discussion among your peers. This will
help us figure out which topics are important for the agenda:

          lsf-pc@lists.linux-foundation.org

... and Cc the mailing lists that are relevant for the topic in
question:

          FS:     linux-fsdevel@vger.kernel.org
          MM:     linux-mm@kvack.org
          Block:  linux-block@vger.kernel.org
          ATA:    linux-ide@vger.kernel.org
          SCSI:   linux-scsi@vger.kernel.org
          NVMe:   linux-nvme@lists.infradead.org
          BPF:    bpf@vger.kernel.org

Please tag your proposal with [LSF/MM/BPF TOPIC] to make it easier
to track. In addition, please make sure to start a new thread for
each topic rather than following up to an existing one. Agenda
topics and attendees will be selected by the program committee,
but the final agenda will be formed by consensus of the attendees
on the day.

3) This year we would also like to try and make sure we are
including new members in the community that the program committee
may not be familiar with. The Google form has an area for people to
add required/optional attendees. Please encourage new members of the
community to submit a request for an invite as well, but additionally
if maintainers or long term community members could add nominees to
the form it would help us make sure that new members get the proper
consideration.

For discussion leaders, slides and visualizations are encouraged to
outline the subject matter and focus the discussions. Please refrain
from lengthy presentations and talks in order for sessions to be
productive; the sessions are supposed to be interactive, inclusive
discussions.

We are still looking into the virtual component. We will likely run
something similar to what we did last year, but details on that will
be forthcoming.

2023: https://lwn.net/Articles/lsfmmbpf2023/

2022: https://lwn.net/Articles/lsfmm2022/

2019: https://lwn.net/Articles/lsfmm2019/

2018: https://lwn.net/Articles/lsfmm2018/

2017: https://lwn.net/Articles/lsfmm2017/

2016: https://lwn.net/Articles/lsfmm2016/

2015: https://lwn.net/Articles/lsfmm2015/

2014: http://lwn.net/Articles/LSFMM2014/

4) If you have feedback on last year's meeting that we can use to
improve this year's, please also send that to:

          lsf-pc@lists.linux-foundation.org

Thank you on behalf of the program committee:

          Amir Goldstein (Filesystems)
          Jan Kara (Filesystems)
          Martin K. Petersen (Storage)
          Javier González (Storage)
          Michal Hocko (MM)
          Dan Williams (MM)
          Daniel Borkmann (BPF)
          Martin KaFai Lau (BPF)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large block for I/O
  2024-01-08 19:35             ` Matthew Wilcox
@ 2024-02-22 18:45               ` Luis Chamberlain
  2024-02-25 23:09                 ` Dave Chinner
  0 siblings, 1 reply; 26+ messages in thread
From: Luis Chamberlain @ 2024-02-22 18:45 UTC (permalink / raw)
  To: Matthew Wilcox, Daniel Gomez, Pankaj Raghav, Dave Chinner, Jan Kara
  Cc: Bart Van Assche, Christoph Hellwig, Hannes Reinecke, lsf-pc,
	linux-mm, linux-block, linux-scsi, linux-nvme

On Mon, Jan 08, 2024 at 07:35:17PM +0000, Matthew Wilcox wrote:
> On Mon, Jan 08, 2024 at 11:30:10AM -0800, Bart Van Assche wrote:
> > On 12/21/23 21:37, Christoph Hellwig wrote:
> > > On Fri, Dec 22, 2023 at 05:13:43AM +0000, Matthew Wilcox wrote:
> > > > It clearly solves a problem (and the one I think it's solving is the
> > > > size of the FTL map).  But I can't see why we should stop working on it,
> > > > just because not all drive manufacturers want to support it.
> > > 
> > > I don't think it is drive vendors.  It is is the SSD divisions which
> > > all pretty much love it (for certain use cases) vs the UFS/eMMC
> > > divisions which tends to often be fearful and less knowledgeable (to
> > > say it nicely) no matter what vendor you're talking to.
> > 
> > Hi Christoph,
> > 
> > If there is a significant number of 4 KiB writes in a workload (e.g.
> > filesystem metadata writes), and the logical block size is increased from
> > 4 KiB to 16 KiB, this will increase write amplification no matter how the
> > SSD storage controller has been designed, isn't it? Is there perhaps
> > something that I'm misunderstanding?
> 
> You're misunderstanding that it's the _drive_ which gets to decide the
> logical block size. Filesystems literally can't do 4kB writes to these
> drives; you can't do a write smaller than a block.  If your clients
> don't think it's a good tradeoff for them, they won't tell Linux that
> the minimum IO size is 16kB.

Yes, but its perhaps good to review how flexible this might be or not.
I can at least mention what I know of for NVMe. Getting a lay of the
land of this for other storage media would be good.

Some of the large capacity NVMe drives have NPWG as 16k, that just means
the Indirection Unit is 16k, the mapping table, so the drive is hinting
*we prefer 16k* but you can still do 4k writes, it just means for all
these drives that a 4k write will be a RMW.

Users who *want* to help avoid RMWs on these drives and want to increase the
writes to be at least 16k can enable a 16k or larger block size so to
align the writes. The experimentation we have done using Daniel Gomez's
eBPF blkalgn tool [0] reveal (as discussed at last year's Plumbers) that
there were still some 4k writes, this was in turn determined to be due
to XFS's buffer cache usage for metadata. Dave recently posted patches to allow
to use large folios on the xfs buffer cache [1], and Daniel has started making
further observations on this which he'll be revealing soon.

[0] https://github.com/dagmcr/bcc/tree/blkalgn-dump
[1] https://lore.kernel.org/all/20240118222216.4131379-1-david@fromorbit.com/

For large capacity NVMe drives with large atomics (NAUWPF), the
nvme block driver will allow for the physical block size to be 16k too,
thus allowing the sector size to be set to 16k when creating the
filesystem, that would *optionally* allow for users to force the
filesystem to not allow *any* writes to the device to be 4k. Note
then that there are two ways to be able to use a sector size of 16k
for NVMe today then, one is if your drive supported 16 LBA format and
another is with these two parameters set to 16k. The later allows you
to stick with 512 byte or 4k LBA format and still use a 16k sector size.
That allows you to remain backward compatible.

Jan Kara's patches "block: Add config option to not allow writing to
mounted devices" [2] should allow us to remove the set_blocksize() call
in xfs_setsize_buftarg() since XFS does not use the block device cache
at all, and his pathces ensure once a filesystem is mounted userspace
won't muck with the block device directly.

As for the impact of this for 4k writes, if you create the filesystem
with a 16 sector size then we're strict, and it means at minimum 16k is
needed. It is no different than what is done for 4k where the logical
block size is 512 bytes and we use a 4k sector size as the physical
block size is 4k. If using buffered IO then we can leverage the page
cache for modifications. Either way, you should do your WAF homework
too. Even if you *do* have 4k workloads, underneath the hood you may see
that as of matter of fact the number of IOs which are 4k are very likely
small in count. In so far as WAF is concerned, the *IO volume* is what
matters.  Luca Bert has a great write up on his team's findings when
evaluating some real world workload's WAF estimates when considering
IO volume [3].

[2] https://lkml.kernel.org/r/20231101173542.23597-1-jack@suse.cz
[3] https://www.micron.com/about/blog/2023/october/real-life-workloads-allow-more-efficient-data-granularity-and-enable-very-large-ssd-capacities

We were not aware of public open source tools to do what they did,
so we worked on a tool that allows just that. You can measure your
workload WAF using Daniel Gomez's WAF tool for NVMe [4] and decide if
the tradeoffs are acceptable. It would be good for us to automate
generic workloads, slap it on kdevops, and compute WAF, for instance.

[4] https://github.com/dagmcr/bcc/tree/nvmeiuwaf

> Some workloads are better with a 4kB block size, no doubt.  Others are
> better with a 512 byte block size.  That doesn't prevent vendors from
> offering 4kB LBA size drives.

Indeed, using large block sizes by no not meant for all workloads. But
it's a good time to also remind folks that larger IOs tend to just be
good for flash storage in general too. So if your WAF measurements check
out, using large block sizes is something to evaluate.

 Luis

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large block for I/O
  2023-12-20 15:03   ` [LSF/MM/BPF TOPIC] Large block for I/O Hannes Reinecke
  2023-12-21 20:33     ` Bart Van Assche
@ 2024-02-23 16:41     ` Pankaj Raghav (Samsung)
  1 sibling, 0 replies; 26+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-23 16:41 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: lsf-pc, linux-mm, linux-block, linux-scsi, linux-nvme, mcgrof, p.raghav

On Wed, Dec 20, 2023 at 04:03:43PM +0100, Hannes Reinecke wrote:
> Hi all,
> 
> I would like to discuss
> 
> Large blocks for I/O
> 
> Since the presentation last year there has been quite some developments
> and improvements in some areas, but at the same time a lack of progress
> in other areas.
> In this presentation/discussion I would like to highlight the current
> state of affairs, existing pain points, and future directions of
> development.
> It might be an idea to co-locate it with the MM folks as we do have
> quite some overlap with page-cache improvements and hugepage handling.

I am interested in attending this session. As we are getting closer to
having LBS in XFS[1], we could then have the LBS support for block
devices for free if we use the iomap to interact with the block cache 
(!CONFIG_BUFFER_HEAD).

So one of the focus points for this discussion could be on adding the LBS
support to the buffer_head path for block devices and blockers (if any).

Another important discussion point is testing. xfstests helped iron out
bugs in page cache and iomap while adding the LBS support for XFS. If we
add support to buffer_heads, then how are we going to stress test the changes?
I doubt just blktests would be enough to test the changes in page cache
and buffer_heads.

[1] https://lore.kernel.org/linux-xfs/20240213093713.1753368-1-kernel@pankajraghav.com/

--
Pankaj

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large block for I/O
  2024-02-22 18:45               ` Luis Chamberlain
@ 2024-02-25 23:09                 ` Dave Chinner
  2024-02-26 15:25                   ` Luis Chamberlain
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2024-02-25 23:09 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Matthew Wilcox, Daniel Gomez, Pankaj Raghav, Jan Kara,
	Bart Van Assche, Christoph Hellwig, Hannes Reinecke, lsf-pc,
	linux-mm, linux-block, linux-scsi, linux-nvme

On Thu, Feb 22, 2024 at 10:45:25AM -0800, Luis Chamberlain wrote:
> On Mon, Jan 08, 2024 at 07:35:17PM +0000, Matthew Wilcox wrote:
> > On Mon, Jan 08, 2024 at 11:30:10AM -0800, Bart Van Assche wrote:
> > > On 12/21/23 21:37, Christoph Hellwig wrote:
> > > > On Fri, Dec 22, 2023 at 05:13:43AM +0000, Matthew Wilcox wrote:
> > > > > It clearly solves a problem (and the one I think it's solving is the
> > > > > size of the FTL map).  But I can't see why we should stop working on it,
> > > > > just because not all drive manufacturers want to support it.
> > > > 
> > > > I don't think it is drive vendors.  It is is the SSD divisions which
> > > > all pretty much love it (for certain use cases) vs the UFS/eMMC
> > > > divisions which tends to often be fearful and less knowledgeable (to
> > > > say it nicely) no matter what vendor you're talking to.
> > > 
> > > Hi Christoph,
> > > 
> > > If there is a significant number of 4 KiB writes in a workload (e.g.
> > > filesystem metadata writes), and the logical block size is increased from
> > > 4 KiB to 16 KiB, this will increase write amplification no matter how the
> > > SSD storage controller has been designed, isn't it? Is there perhaps
> > > something that I'm misunderstanding?
> > 
> > You're misunderstanding that it's the _drive_ which gets to decide the
> > logical block size. Filesystems literally can't do 4kB writes to these
> > drives; you can't do a write smaller than a block.  If your clients
> > don't think it's a good tradeoff for them, they won't tell Linux that
> > the minimum IO size is 16kB.
> 
> Yes, but its perhaps good to review how flexible this might be or not.
> I can at least mention what I know of for NVMe. Getting a lay of the
> land of this for other storage media would be good.
> 
> Some of the large capacity NVMe drives have NPWG as 16k, that just means
> the Indirection Unit is 16k, the mapping table, so the drive is hinting
> *we prefer 16k* but you can still do 4k writes, it just means for all
> these drives that a 4k write will be a RMW.

That's just a 4kb logical sector, 16kB physical sector block device,
yes?

Maybe I'm missing something, but we already handle cases like that
just fine thanks to all the work that went into supporting 512e
devices...

> Users who *want* to help avoid RMWs on these drives and want to increase the
> writes to be at least 16k can enable a 16k or larger block size so to
> align the writes. The experimentation we have done using Daniel Gomez's
> eBPF blkalgn tool [0] reveal (as discussed at last year's Plumbers) that
> there were still some 4k writes, this was in turn determined to be due
> to XFS's buffer cache usage for metadata.

As I've explained several times, XFS AG headers are sector sized
metadata. If you are exposing a 4kB logical sector size on a 16kB
physical sector device, this is what you'll get. It's been that way
with 512e devices for a long time, yes?

Also, direct IO will allow sector sized user data IOs, too, so it's
not just XFS metadata that will be issuing 4kB IO in this case...

> Dave recently posted patches to allow
> to use large folios on the xfs buffer cache [1],

This has nothing to do with supporting large block sizes - it's
purely an internal optimisation to reduce the amount of vmap
(vmalloc) work we have to do for buffers that are larger than
PAGE_SIZE on 4kB block size filesystems.

> For large capacity NVMe drives with large atomics (NAUWPF), the
> nvme block driver will allow for the physical block size to be 16k too,
> thus allowing the sector size to be set to 16k when creating the
> filesystem, that would *optionally* allow for users to force the
> filesystem to not allow *any* writes to the device to be 4k.

Just present it as a 16kB logical/physical sector block device. Then
userspace and the filesystem will magically just do the right thing.

We've already solved these problems, yes?

> Note
> then that there are two ways to be able to use a sector size of 16k
> for NVMe today then, one is if your drive supported 16 LBA format and
> another is with these two parameters set to 16k. The later allows you
> to stick with 512 byte or 4k LBA format and still use a 16k sector size.
> That allows you to remain backward compatible.

Yes, that's an emulated small logical sector size block device.
We've been supporting this for years - how are these NVMe drives in
any way different? Configure the drive this way, it presents as a
512e or 4096e device, not a 16kB sector size device, yes?

> Jan Kara's patches "block: Add config option to not allow writing to
> mounted devices" [2] should allow us to remove the set_blocksize() call
> in xfs_setsize_buftarg() since XFS does not use the block device cache
> at all, and his pathces ensure once a filesystem is mounted userspace
> won't muck with the block device directly.

That patch is completely irrelevant to how the block device presents
sector sizes to userspace and the filesystem. It's also completely
irrelevant to large block size support in filesystems. Why do you
think it is relevant at all?

<snip irrelevant WAF stuff that we already know all about>

I'm not sure exactly what is being argued about here, but if the
large sector size support requires filesystem utilities to treat
4096e NVMe devices differently to existing 512e devices then the
large sector size support stuff has gone completely off the rails.

We already have all the mechanisms needed for optimising layouts for
large physical sector sizes w/ small emulated sector sizes and we
have widespread userspace support for that. If this new large block
device sector stuff doesn't work the same way, then you need to go
back to the drawing board and make it work transparently with all
the existing userspace infrastructure....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large block for I/O
  2024-02-25 23:09                 ` Dave Chinner
@ 2024-02-26 15:25                   ` Luis Chamberlain
  2024-03-07  1:59                     ` Luis Chamberlain
  0 siblings, 1 reply; 26+ messages in thread
From: Luis Chamberlain @ 2024-02-26 15:25 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Daniel Gomez, Pankaj Raghav, Jan Kara,
	Bart Van Assche, Christoph Hellwig, Hannes Reinecke, lsf-pc,
	linux-mm, linux-block, linux-scsi, linux-nvme

On Mon, Feb 26, 2024 at 10:09:08AM +1100, Dave Chinner wrote:
> On Thu, Feb 22, 2024 at 10:45:25AM -0800, Luis Chamberlain wrote:
> > On Mon, Jan 08, 2024 at 07:35:17PM +0000, Matthew Wilcox wrote:
> > > On Mon, Jan 08, 2024 at 11:30:10AM -0800, Bart Van Assche wrote:
> > > > On 12/21/23 21:37, Christoph Hellwig wrote:
> > > > > On Fri, Dec 22, 2023 at 05:13:43AM +0000, Matthew Wilcox wrote:
> > > > > > It clearly solves a problem (and the one I think it's solving is the
> > > > > > size of the FTL map).  But I can't see why we should stop working on it,
> > > > > > just because not all drive manufacturers want to support it.
> > > > > 
> > > > > I don't think it is drive vendors.  It is is the SSD divisions which
> > > > > all pretty much love it (for certain use cases) vs the UFS/eMMC
> > > > > divisions which tends to often be fearful and less knowledgeable (to
> > > > > say it nicely) no matter what vendor you're talking to.
> > > > 
> > > > Hi Christoph,
> > > > 
> > > > If there is a significant number of 4 KiB writes in a workload (e.g.
> > > > filesystem metadata writes), and the logical block size is increased from
> > > > 4 KiB to 16 KiB, this will increase write amplification no matter how the
> > > > SSD storage controller has been designed, isn't it? Is there perhaps
> > > > something that I'm misunderstanding?
> > > 
> > > You're misunderstanding that it's the _drive_ which gets to decide the
> > > logical block size. Filesystems literally can't do 4kB writes to these
> > > drives; you can't do a write smaller than a block.  If your clients
> > > don't think it's a good tradeoff for them, they won't tell Linux that
> > > the minimum IO size is 16kB.
> > 
> > Yes, but its perhaps good to review how flexible this might be or not.
> > I can at least mention what I know of for NVMe. Getting a lay of the
> > land of this for other storage media would be good.
> > 
> > Some of the large capacity NVMe drives have NPWG as 16k, that just means
> > the Indirection Unit is 16k, the mapping table, so the drive is hinting
> > *we prefer 16k* but you can still do 4k writes, it just means for all
> > these drives that a 4k write will be a RMW.
> 
> That's just a 4kb logical sector, 16kB physical sector block device,
> yes?

Yes.

> Maybe I'm missing something, but we already handle cases like that
> just fine thanks to all the work that went into supporting 512e
> devices...

Nothing new, it is just that for QLC drives with a 16k mapping table
a 4k write is internally a RMW.

> > Users who *want* to help avoid RMWs on these drives and want to increase the
> > writes to be at least 16k can enable a 16k or larger block size so to
> > align the writes. The experimentation we have done using Daniel Gomez's
> > eBPF blkalgn tool [0] reveal (as discussed at last year's Plumbers) that
> > there were still some 4k writes, this was in turn determined to be due
> > to XFS's buffer cache usage for metadata.
> 
> As I've explained several times, XFS AG headers are sector sized
> metadata. If you are exposing a 4kB logical sector size on a 16kB
> physical sector device, this is what you'll get. It's been that way
> with 512e devices for a long time, yes?

Sure!

> Also, direct IO will allow sector sized user data IOs, too, so it's
> not just XFS metadata that will be issuing 4kB IO in this case...

Yup..

> > Dave recently posted patches to allow
> > to use large folios on the xfs buffer cache [1],
> 
> This has nothing to do with supporting large block sizes - it's
> purely an internal optimisation to reduce the amount of vmap
> (vmalloc) work we have to do for buffers that are larger than
> PAGE_SIZE on 4kB block size filesystems.

Oh sure, but I'm suggesting that for drives without the large atomic
it should still help to have this as there is less aligned writes.

> > For large capacity NVMe drives with large atomics (NAUWPF), the
> > nvme block driver will allow for the physical block size to be 16k too,
> > thus allowing the sector size to be set to 16k when creating the
> > filesystem, that would *optionally* allow for users to force the
> > filesystem to not allow *any* writes to the device to be 4k.
> 
> Just present it as a 16kB logical/physical sector block device. Then
> userspace and the filesystem will magically just do the right thing.

That is a sensible thing to me, I just wonder if there are some use
cases for users who want to opt-in for the pain to and want to accept
the 4k writes. It would be silly, but alas possible.

After thinking about this a bit, I don't think the pain of flexibility
is worth it. All userspace applications looking to do correct alignement
will use the logical block size, and if we keep that at 4k, and expect
them only to use the physical block sizes, it's just asking for pain.

> We've already solved these problems, yes?

I agree, I figured the above might need some discussion.

> > Note
> > then that there are two ways to be able to use a sector size of 16k
> > for NVMe today then, one is if your drive supported 16 LBA format and
> > another is with these two parameters set to 16k. The later allows you
> > to stick with 512 byte or 4k LBA format and still use a 16k sector size.
> > That allows you to remain backward compatible.
> 
> Yes, that's an emulated small logical sector size block device.
> We've been supporting this for years - how are these NVMe drives in
> any way different? Configure the drive this way, it presents as a
> 512e or 4096e device, not a 16kB sector size device, yes?

Yup.

> > Jan Kara's patches "block: Add config option to not allow writing to
> > mounted devices" [2] should allow us to remove the set_blocksize() call
> > in xfs_setsize_buftarg() since XFS does not use the block device cache
> > at all, and his pathces ensure once a filesystem is mounted userspace
> > won't muck with the block device directly.
> 
> That patch is completely irrelevant to how the block device presents
> sector sizes to userspace and the filesystem. It's also completely
> irrelevant to large block size support in filesystems. Why do you
> think it is relevant at all?

Today's set_blocksize() call from xfs_setsize_buftarg() would limit
the block size set for the block device cache, ie, the sector size to
be lifted. Removing it would help allow us to extend the block device
cache to use sector sizes > 4k. That is, it is just one small step in that
direction. The other step is, as you have suggested before, to
enhance the block device cache so that we always use iomap aops and
and switch from iomap page state to buffer heads in the bdev mapping
interface via a synchronised invalidation + setting/clearing
IOMAP_F_BUFFER_HEAD in all new mapping requests [0]: that is to
implement support for bufferheads through the existing iomap
infrastructure.

A second consideration I had was if we wanted to have the flexibility to
have 16k atomic capable drive to allow 4k writes even though it also
prefers 16k, but that I think leads to madness. I am not sure if we
want to allow a 4k write on those drives just because its possible
through any new means.

> I'm not sure exactly what is being argued about here, but if the
> large sector size support requires filesystem utilities to treat
> 4096e NVMe devices differently to existing 512e devices then the
> large sector size support stuff has gone completely off the rails.

It is not.

> We already have all the mechanisms needed for optimising layouts for
> large physical sector sizes w/ small emulated sector sizes and we
> have widespread userspace support for that. If this new large block
> device sector stuff doesn't work the same way, then you need to go
> back to the drawing board and make it work transparently with all
> the existing userspace infrastructure....

The only thing left worth discussing I think is if we want to let users to
opt-in to 4k sector size on a drive which allows 16k atomics and
prefers 16k for instance...

My current thinking is we just stick to 16k logical block sizes for
those drives. But I welcome further arguments against that.

  Luis

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large block for I/O
  2024-02-26 15:25                   ` Luis Chamberlain
@ 2024-03-07  1:59                     ` Luis Chamberlain
  2024-03-07  5:31                       ` Dave Chinner
  0 siblings, 1 reply; 26+ messages in thread
From: Luis Chamberlain @ 2024-03-07  1:59 UTC (permalink / raw)
  To: Dave Chinner, kbus >> Keith Busch, NeilBrown, Tso Ted
  Cc: Matthew Wilcox, Daniel Gomez, Pankaj Raghav, Jan Kara,
	Bart Van Assche, Christoph Hellwig, Hannes Reinecke,
	Javier González, lsf-pc, linux-mm, linux-block, linux-scsi,
	linux-nvme

On Mon, Feb 26, 2024 at 07:25:23AM -0800, Luis Chamberlain wrote:
> The only thing left worth discussing I think is if we want to let users to
> opt-in to 4k sector size on a drive which allows 16k atomics and
> prefers 16k for instance...

Thinking about this again, I get a sense things are OK as-is but
let's review. I'd also like to clarify that these drives keep a 4k
LBA format. The only thing that changes is an increased is the the IU
and large atomic. The nvme driver today sets the physical block size
to the min of both.

It is similar to a drive today with a logical block size of 512 but a
physical block size of 4k. That allows you to specify a larger sector
size to create the filesystem.

After reviewing again our current language for syfs parameters in
Documentation/ABI/stable/sysfs-block for logical and physical block size
and how nvme_update_disk_info() deals with NPWG (the IU used), NAWUPF
(namespace atomic) and NOWS (optimal write size), we seem to be using
it aready appropriately. However, I think the language used in the
original commit c72758f33784e ("block: Export I/O topology for block
devices and partitions") on May 22, 2009, which clearly outlined
the implications of a read-modify-write makes it even clearer.
A later commit 7e5f5fb09e6fc ("block: Update topology documentation") 3
months later updates the documentation to remove the read-modify-write
language in favor of "atomic".

Even though we'd like to believe that userspace is doing the right thing
today, I think it would be good to review this today and ensure we're
doing the right thing to make things transparent.

We have two types large capacity NVMe drives to consider. As far as I
can tell all drives will always supporting 512 or 4k LBA format which
controls the logical block size, so they always remain backward
compatible. It would be up to users to format the drives to 512 or 4k.

One type of drive is without the large NAWUPF (atomic), and another with it.
Both will have a large NPWG (the IU). The NPWG is used to set minimum_io_size,
and so, using the original commit's langauge for minimum_io_size, direct IO
would benefit best to rely on that as a minimum.

At least Red Hat's documentation [0] about this suggests that minimum_io_size
will be read by userspace but at least for direct IO it suggests that
direct IO must be aligned to *mutiples of the logical block size*.
That does not clarify to me however if the minimum IO used in userspace
today for direct IO will rely on minimum_io_size. If it is, then things
will work optimally for these drives already.

When a large atomic is supported (NAWUPF) the physical block size will be
lifted, and users can use that to create a filesystem with a larger
sector size than 4k. That certainly would help ensure at least the
filesystem aligns all metadata and data to the large IU. After Jan
Kara's patches which prevent writes to the block device once a
filesystem is mounted, userspace would not be allowed to be mucking
around with the block device, so userspace IO using the raw block device 
with, say a smaller logical sector size would not be allowed. Since,
in these cases the sector size is set to a larger value for the filesystem,
direct IO on the filesystem should respect that preferred larger sector size.

If userspace is indeed already relying on minimum_io_size correctly I am
not sure if we need to do any change. Drives with a large NPWG would get
minimum_io_size set for them already. And a large atomic would just lift
the physical block size. So I don't think we need to force the logical
block size to be 16k of both NPWG and NAWUPF are 16k. *Iff* however, we
feel we may want to help userspace further, I wonder if having the option to
lift the logical block size to the NPWG is desriable.

I did some testing with fio against a 4k physical virtio drive with a
512 byte logical block size and creating a 4k block size XFS filesystem
with a 4k sector size. fio seems to chug along happy if you issue writes
with -bs=512 and even -blockalign=512.

Using Daniel Gomez's ./tools/blkalgn.py tool I still see 512 IO commands
issued, and I don't think they failed. But this was against a virtio
drive for convenience. QEMU in NVMe today doesn't let you have a
different logical block size than physical, so you'd need to do some
odd hacks to test something similar to emulate a large atomic.

root@frag ~/bcc (git::blkalgn)# cat
/sys/block/vdh/queue/physical_block_size 
4096
root@frag ~/bcc (git::blkalgn)# cat
/sys/block/vdh/queue/logical_block_size 
512

mkfs.xfs -f -b size=4k -s size=4k /dev/vdh

fio -name=ten-1g-per-thread --nrfiles=10 -bs=512 -ioengine=io_uring \
-direct=1 \
-blockalign=512 \
--group_reporting=1 --alloc-size=1048576 --filesize=8KiB \
--readwrite=write --fallocate=none --numjobs=$(nproc) --create_on_open=1 \
--directory=/mnt

root@frag ~/bcc (git::blkalgn)# ./tools/blkalgn.py -d vdh

     Block size          : count     distribution
         0 -> 1          : 4        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 79       |****                                    |
      1024 -> 2047       : 320      |********************                    |
      2048 -> 4095       : 638      |****************************************|
      4096 -> 8191       : 161      |**********                              |
      8192 -> 16383      : 0        |                                        |
     16384 -> 32767      : 1        |                                        |

     Algn size           : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 1196     |****************************************|

Userspace can still be do silly things, but I expected in the above that
512 IOs would not be issued.

[0] https://access.redhat.com/articles/3911611

  Luis

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large block for I/O
  2024-03-07  1:59                     ` Luis Chamberlain
@ 2024-03-07  5:31                       ` Dave Chinner
  2024-03-07  7:29                         ` Luis Chamberlain
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2024-03-07  5:31 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: kbus @pop.gmail.com>> Keith Busch, NeilBrown, Tso Ted,
	Matthew Wilcox, Daniel Gomez, Pankaj Raghav, Jan Kara,
	Bart Van Assche, Christoph Hellwig, Hannes Reinecke,
	Javier González, lsf-pc, linux-mm, linux-block, linux-scsi,
	linux-nvme

On Wed, Mar 06, 2024 at 05:59:01PM -0800, Luis Chamberlain wrote:
> On Mon, Feb 26, 2024 at 07:25:23AM -0800, Luis Chamberlain wrote:
> root@frag ~/bcc (git::blkalgn)# cat
> /sys/block/vdh/queue/physical_block_size 
> 4096
> root@frag ~/bcc (git::blkalgn)# cat
> /sys/block/vdh/queue/logical_block_size 
> 512

This device supports 512 byte aligned IOs.

> mkfs.xfs -f -b size=4k -s size=4k /dev/vdh

This sets the filesystem block size to 4k, and the smallest metadata
block size to 4kB (sector size). It does not force user data direct
IO alignment to be 4kB - that is determined by what the underlying
block device supports, not the filesystem block size or metadata
sector size is set to.

Sure, doing 512 byte aligned/sized IO to a 4kB sector sizer device
is not optimal. IO will to the file will be completely serialised
because they are sub-fs-block DIO writes, but it does work because
the underlying device allows it. Nobody wanting a performant
application will want to do this, but there are cases where this
case fulfils important functional requirements.

e.g. fs tools and loop devices that use direct IO to access file
based filesystem images that have 512 byte sector size will just
work on such a fs and storage setup, even though the host filesystem
isn't configured to use 512 byte sector alignment directly
itself....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large block for I/O
  2024-03-07  5:31                       ` Dave Chinner
@ 2024-03-07  7:29                         ` Luis Chamberlain
  0 siblings, 0 replies; 26+ messages in thread
From: Luis Chamberlain @ 2024-03-07  7:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: kbus @pop.gmail.com>> Keith Busch, NeilBrown, Tso Ted,
	Matthew Wilcox, Daniel Gomez, Pankaj Raghav, Jan Kara,
	Bart Van Assche, Christoph Hellwig, Hannes Reinecke,
	Javier González, lsf-pc, linux-mm, linux-block, linux-scsi,
	linux-nvme

Then it does seem we have everything we need already, and no changes
are needed. Because these drives *do* support the logical block size
advertised, it's just not optimal.

On Thu, Mar 07, 2024 at 04:31:48PM +1100, Dave Chinner wrote:
> Sure, doing 512 byte aligned/sized IO to a 4kB sector sizer device
> is not optimal. IO will to the file will be completely serialised
> because they are sub-fs-block DIO writes, but it does work because
> the underlying device allows it. Nobody wanting a performant
> application will want to do this,

It's a good time to ask though if there may be users who want to opt-in to
promote this sort of situation so that the logical block size is lifted
to prevent any IOs. For NVMe drives it could be where the atomic >= Indirection
Unit (IU). This is applicable even today on a 4k IU drive with 4k atomic
support. Who would want this? Since any IO issued to a drive which is
smaller than the IU implicates a RMW, it means your if you restrict the
drive to only IOs matching the IU could in theory improve endurance.

> but there are cases where this case fulfils important functional requirements.

Sure.

> e.g. fs tools and loop devices that use direct IO to access file
> based filesystem images that have 512 byte sector size will just
> work on such a fs and storage setup, even though the host filesystem
> isn't configured to use 512 byte sector alignment directly
> itself....

It would seem like quite a bit of things. This is useful, thanks.

  Luis

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2024-03-07  7:29 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <7970ad75-ca6a-34b9-43ea-c6f67fe6eae6@iogearbox.net>
2023-12-20 10:01 ` LSF/MM/BPF: 2024: Call for Proposals Daniel Borkmann
2023-12-20 15:03   ` [LSF/MM/BPF TOPIC] Large block for I/O Hannes Reinecke
2023-12-21 20:33     ` Bart Van Assche
2023-12-21 20:42       ` Matthew Wilcox
2023-12-21 21:00         ` Bart Van Assche
2023-12-22  5:09       ` Christoph Hellwig
2023-12-22  5:13       ` Matthew Wilcox
2023-12-22  5:37         ` Christoph Hellwig
2024-01-08 19:30           ` Bart Van Assche
2024-01-08 19:35             ` Matthew Wilcox
2024-02-22 18:45               ` Luis Chamberlain
2024-02-25 23:09                 ` Dave Chinner
2024-02-26 15:25                   ` Luis Chamberlain
2024-03-07  1:59                     ` Luis Chamberlain
2024-03-07  5:31                       ` Dave Chinner
2024-03-07  7:29                         ` Luis Chamberlain
2023-12-22  8:23       ` Viacheslav Dubeyko
2023-12-22 12:29         ` Hannes Reinecke
2023-12-22 13:29           ` Matthew Wilcox
2023-12-22 15:10         ` Keith Busch
2023-12-22 16:06           ` Matthew Wilcox
2023-12-25  8:55             ` Viacheslav Dubeyko
2023-12-25  8:12           ` Viacheslav Dubeyko
2024-02-23 16:41     ` Pankaj Raghav (Samsung)
2024-01-17 13:37   ` LSF/MM/BPF: 2024: Call for Proposals [Reminder] Daniel Borkmann
2024-02-14 13:03     ` LSF/MM/BPF: 2024: Call for Proposals [Final Reminder] Daniel Borkmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).