All of lore.kernel.org
 help / color / mirror / Atom feed
* Sorting blocks in xfs_buf_delwri_submit_buffers() still necessary?
@ 2021-10-13 17:13 Holger Hoffstätte
  2021-10-13 20:57 ` Dave Chinner
  0 siblings, 1 reply; 3+ messages in thread
From: Holger Hoffstätte @ 2021-10-13 17:13 UTC (permalink / raw)
  To: linux-xfs

Hi,

Based on what's going on in blk-mq & NVMe land I though I'd check if XFS still
sorts buffers before sending them down the pipe, and sure enough that still
happens in xfs_buf.c:xfs_buf_delwri_submit_buffers() (the comparson function
is directly above). Before I make a fool of myself and try to remove this,
do we still think this is necessary? If there's a scheduler it will do the
same thing, and SSD/NVMe might do the same in HW anyway or not care.
The only scenario I can think of where this might make a difference is
rotational RAID without scheduler attached. Not sure.

I'm looking forward to hear what a foolish idea this is.

cheers
Holger

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Sorting blocks in xfs_buf_delwri_submit_buffers() still necessary?
  2021-10-13 17:13 Sorting blocks in xfs_buf_delwri_submit_buffers() still necessary? Holger Hoffstätte
@ 2021-10-13 20:57 ` Dave Chinner
  2021-10-14  7:33   ` Holger Hoffstätte
  0 siblings, 1 reply; 3+ messages in thread
From: Dave Chinner @ 2021-10-13 20:57 UTC (permalink / raw)
  To: Holger Hoffstätte; +Cc: linux-xfs

On Wed, Oct 13, 2021 at 07:13:10PM +0200, Holger Hoffstätte wrote:
> Hi,
> 
> Based on what's going on in blk-mq & NVMe land

What's going on in this area that is any different from the past few
years?

> I though I'd check if XFS still
> sorts buffers before sending them down the pipe, and sure enough that still
> happens in xfs_buf.c:xfs_buf_delwri_submit_buffers() (the comparson function
> is directly above). Before I make a fool of myself and try to remove this,
> do we still think this is necessary?

Yes, I do.

A though experiment for you, which you can then back up with actual
simulation with fio:

What is more efficient and faster at the hardware level: 16
individual sequential 4kB IOs or one 64kB IO?

Which of these uses less CPU to dispatch and complete?

Which has less IO in flight and so allows more concurrent IO to be
dispatched to the hardware at the same time?

Answering these questions will give you your answer.

So, play around with AIO to simulate xfs buffer IO - the xfs buffer
cache is really just a high concurrency async IO engine. Use fio to
submit a series of individual sequential 4kB AIO writes with a queue
depth of, say, 128 to a file and time it. Then submit the
same number of sequential 4kB AIO writes as batches of 64 IOs
at a time. Which one is faster? Why is it faster? You'll need to
play around with fio queue batching controls to do this, but you can
simulate it quite easily.

The individual sequential IO fio simulation is the equivalent of
eliding the buffer sort in xfs_buf_delwri_submit_buffers(), whilst
the batched submission is equivalent of what we have now.

Just because your hardware can do a million IOPS, it doesn't mean
the most efficient way to do IO is to dispatch a million IOPS....

> If there's a scheduler it will do the
> same thing, and SSD/NVMe might do the same in HW anyway or not care.
> The only scenario I can think of where this might make a difference is
> rotational RAID without scheduler attached. Not sure.

Schedulers only have a reorder window of a 100-200 individual IOs.
xfs_buf_delwri_submit_buffers() can be passed tens of thousands of
buffers in a single list for IO dispatch that need reordering.

IOWs, the sort+merge window for number of IOs we dispatch from
metadata writeback is often orders of magnitude larger than what the
block layer scheduler can optimise effectively. The IO stack
behaviour is largely GIGO, so anythign we can do at a highly layer
to make the IO submission less garbage-like results in improved
throughput through the software and hardware layers of the storage
stack.

> I'm looking forward to hear what a foolish idea this is.

If the list_sort() is not showing up in profiles, then it is
essentially free. Last time I checked, our biggest overhead was the
CPU overhead of flushing a million inodes/s from the AIL to their
backing buffers - the list_sort() didn't even show up in the top 50
functions in the profile...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Sorting blocks in xfs_buf_delwri_submit_buffers() still necessary?
  2021-10-13 20:57 ` Dave Chinner
@ 2021-10-14  7:33   ` Holger Hoffstätte
  0 siblings, 0 replies; 3+ messages in thread
From: Holger Hoffstätte @ 2021-10-14  7:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

[-- Attachment #1: Type: text/plain, Size: 894 bytes --]


On Thu, 14 Oct 2021, Dave Chinner wrote:

> On Wed, Oct 13, 2021 at 07:13:10PM +0200, Holger Hoffstätte wrote:
>> Hi,
>>
>> Based on what's going on in blk-mq & NVMe land
>
> What's going on in this area that is any different from the past few
> years?

Nothing in particular, just watching Jens pull out all the stops is
interesting, and all sorts of other overheads are peeking out from
under the couch.

>> I though I'd check if XFS still
>> sorts buffers before sending them down the pipe, and sure enough that still
>> happens in xfs_buf.c:xfs_buf_delwri_submit_buffers() (the comparson function
>> is directly above). Before I make a fool of myself and try to remove this,
>> do we still think this is necessary?
>
> Yes, I do.

Ok - I completely forgot about merging adjacent requests, and that
only works if they are somehow sorted. Makes sense.

Thank you for the explanation!

Holger

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-10-14  7:33 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-13 17:13 Sorting blocks in xfs_buf_delwri_submit_buffers() still necessary? Holger Hoffstätte
2021-10-13 20:57 ` Dave Chinner
2021-10-14  7:33   ` Holger Hoffstätte

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.