linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
@ 2024-02-23 23:59 Luis Chamberlain
  2024-02-24  4:12 ` Matthew Wilcox
                   ` (3 more replies)
  0 siblings, 4 replies; 87+ messages in thread
From: Luis Chamberlain @ 2024-02-23 23:59 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe,
	Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner,
	Matthew Wilcox, Linus Torvalds, mcgrof

Part of the testing we have done with LBS was to do some performance
tests on XFS to ensure things are not regressing. Building linux is a
fine decent test and we did some random cloud instance tests on that and
presented that at Plumbers, but it doesn't really cut it if we want to
push things to the limit though. What are the limits to buffered IO
and how do we test that? Who keeps track of it?

The obvious recurring tension is that for really high performance folks
just recommend to use birect IO. But if you are stress testing changes
to a filesystem and want to push buffered IO to its limits it makes
sense to stick to buffered IO, otherwise how else do we test it?

It is good to know limits to buffered IO too because some workloads
cannot use direct IO.  For instance PostgreSQL doesn't have direct IO
support and even as late as the end of last year we learned that adding
direct IO to PostgreSQL would be difficult.  Chris Mason has noted also
that direct IO can also force writes during reads (?)... Anyway, testing
the limits of buffered IO limits to ensure you are not creating
regressions when doing some page cache surgery seems like it might be
useful and a sensible thing to do .... The good news is we have not found
regressions with LBS but all the testing seems to beg the question, of what
are the limits of buffered IO anyway, and how does it scale? Do we know, do
we care? Do we keep track of it? How does it compare to direct IO for some
workloads? How big is the delta? How do we best test that? How do we
automate all that? Do we want to automatically test this to avoid regressions?

The obvious issues with some workloads for buffered IO is having a
possible penality if you are not really re-using folios added to the
page cache. Jens Axboe reported a while ago issues with workloads with
random reads over a data set 10x the size of RAM and also proposed
RWF_UNCACHED as a way to help [0]. As Chinner put it, this seemed more
like direct IO with kernel pages and a memcpy(), and it requires
further serialization to be implemented that we already do for
direct IO for writes. There at least seems to be agreement that if we're
going to provide an enhancement or alternative that we should strive to not
make the same mistakes we've done with direct IO. The rationale for some
workloads to use buffered IO is it helps reduce some tail latencies, so
that's something to live up to.

On that same thread Christoph also mentioned the possibility of a direct
IO variant which can leverage the cache. Is that something we want to
move forward with?

Chris Mason also listed a few other desirables if we do:

- Allowing concurrent writes (xfs DIO does this now)
- Optionally doing zero copy if alignment is good (btrfs DIO does this now)
- Optionally tossing pages at the end (requires a separate syscall now)
- Supporting aio via io_uring

I recently ran a different type of simple test, focused on sequantial writes
to fill capacity, with write workload essentially matching your RAM, so
having parity with your RAM. Technically in the case of max size that I
tested the writes were just *slightly* over the RAM, that's a minor
technicality given I did other tests with similar sizes which showed similar
results... This test should be possible to reproduce then if you have more
than enough RAM to spare. In this case the system uses 1 TiB RAM, using
pmem to avoid drive variance / GC / and other drive shenanigans.

So pmem grub setup:

memmap=500G!4G memmap=3G!504G

As noted earlier, surely, DIO / DAX is best for pmem (and I actually get
a difference between using just DIO and DAX, but that digresses), but
when one is wishing to test buffered IO on purpose it makes sense to do
this. Yes, we can test tmpfs too... but I believe that topic will be
brought up at LSFMM separately.  The delta with DIO and buffered IO on
XFS is astronomical:

  ~86 GiB/s on pmem DIO on xfs with 64k block size, 1024 XFS agcount on x86_64
     Vs
 ~ 7,000 MiB/s with buffered IO

Let's recap, the system had 1 TiB RAM, about half was dedicated for
pmem: it used two pmem paritions one 500 GiB for XFS and another 3 GiB
for an XFS external journal. The system has DDR5 which has a cap at
35.76 GiB/s. The proprietary intel tool to test RAM crashes on me... and
leaves a kernel splat that seems to indicate the tool is doing something
we don't allow anymore. So I couldn't get an aggregate RAM baseline on
just RAM. The kernel used was 6.6.0-rc5 + our LBS changes on top.

The buffered io fio incantation:

fio -name=ten-1g-per-thread --nrfiles=10 -bs=2M -ioengine=io_uring
-direct=0
--group_reporting=1 --alloc-size=1048576 --filesize=1GiB
--readwrite=write --fallocate=none --numjobs=$(nproc) --create_on_open=1
--directory=/mnt

This dedicates 1 GiB to scale up to nproc (128) threads, it will then
use each thread to write 1 GiB of data at a time 10 times. So 10 files
of 1 GiB per thread. So this easily fills until we're out of space. Each
threads ends up writing about ~300 - 400 MiB per each of its 10 files as
we run out of space fast.

-direct=1 for direct IO and -direct=0 for buffered IO.

Perhaps this is a silly topic to bring up, I don't know, but it seemed
to me just strikinly low the max throughput wall on writes given the
amount of free memory on the system.

Reducing the data written to say just 128 GiB makes the throughput climb
to about ~ 13 GiB/s for buffered IO... but doubling that to 256 GiB we
get only 9276 MiB/s.  So even if we have twice the RAM available we hit
a throughput wall which is *still* astronomically different than direct
IO with this setup.

Should we talk about these things at LSFMM? Are folks interested?

[0] https://lwn.net/Articles/806980/
[1] https://patchwork.kernel.org/project/linux-fsdevel/cover/20191211152943.2933-1-axboe@kernel.dk/

  Luis

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-23 23:59 [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO Luis Chamberlain
@ 2024-02-24  4:12 ` Matthew Wilcox
  2024-02-24 17:31   ` Linus Torvalds
  2024-02-24 17:55   ` Luis Chamberlain
  2024-02-25  5:24 ` Kent Overstreet
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 87+ messages in thread
From: Matthew Wilcox @ 2024-02-24  4:12 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner, Linus Torvalds

On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote:
> Part of the testing we have done with LBS was to do some performance
> tests on XFS to ensure things are not regressing. Building linux is a
> fine decent test and we did some random cloud instance tests on that and
> presented that at Plumbers, but it doesn't really cut it if we want to
> push things to the limit though. What are the limits to buffered IO
> and how do we test that? Who keeps track of it?

TLDR: Why does the pagecache suck?

>   ~86 GiB/s on pmem DIO on xfs with 64k block size, 1024 XFS agcount on x86_64
>      Vs
>  ~ 7,000 MiB/s with buffered IO

Profile?  My guess is that you're bottlenecked on the xa_lock between
memory reclaim removing folios from the page cache and the various
threads adding folios to the page cache.

If each thread has its own file, that would help.  If the threads do
their own reclaim that would help the page cache ... but then they'd
contend on the node's lru lock instead, so just trading one pain for
another.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-24  4:12 ` Matthew Wilcox
@ 2024-02-24 17:31   ` Linus Torvalds
  2024-02-24 18:13     ` Matthew Wilcox
                       ` (2 more replies)
  2024-02-24 17:55   ` Luis Chamberlain
  1 sibling, 3 replies; 87+ messages in thread
From: Linus Torvalds @ 2024-02-24 17:31 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez,
	Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig,
	Chris Mason, Johannes Weiner

On Fri, 23 Feb 2024 at 20:12, Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote:
> >  What are the limits to buffered IO
> > and how do we test that? Who keeps track of it?
>
> TLDR: Why does the pagecache suck?

What? No.

Our page cache is so good that the question is literally "what are the
limits of it", and "how we would measure them".

That's not a sign of suckage.

When you have to have completely unrealistic loads that nobody would
actually care about in reality just to get a number for the limit,
it's not a sign of problems.

Or rather, the "problem" is the person looking at a stupid load, and
going "we need to improve this because I can write a benchmark for
this".

Here's a clue: a hardware discussion forum I visit was arguing about
memory latencies, and talking about how their measured overhead of
DRAM latency was literally 85% on the CPU side, not the DRAM side.

Guess what? It's because the CPU in question had quite a bit of L3,
and it was spread out, and the CPU doesn't even start the memory
access before it has checked caches.

And here's a big honking clue: only a complete nincompoop and mentally
deficient rodent would look at that and say "caches suck".

> >  ~86 GiB/s on pmem DIO on xfs with 64k block size, 1024 XFS agcount on x86_64
> >      Vs
> >  ~ 7,000 MiB/s with buffered IO
>
> Profile?  My guess is that you're bottlenecked on the xa_lock between
> memory reclaim removing folios from the page cache and the various
> threads adding folios to the page cache.

I doubt it's the locking.

In fact, for writeout in particular it's probably not even the page
cache at all.

For writeout, we have a very traditional problem: we care about a
million times more about latency than we care about throughput,
because nobody ever actually cares all that much about performance of
huge writes.

Ask yourself when you have last *really* sat there waiting for writes,
unless it's some dog-slow USB device that writes at 100kB/s?

The main situation where people care about cached write performance
(ignoring silly benchmarks) tends to be when you create files, and the
directory entry ordering means that the bottleneck is a number of
small writes and their *ordering* and their latency.

And then the issue is basically never the page cache, but the
filesystem ordering of the metadata writes against each other and
against the page writeout.

Why? Because on all but a *miniscule* percentage of loads, all the
actual data writes are quite gracefully taken by the page cache
completely asynchronously, and nobody ever cares about the writeout
latencies.

Now, the benchmark that Luis highlighted is a completely different
class of historical problems that has been around forever, namely the
"fill up lots of memory with dirty data".

And there - because the problem is easy to trigger but nobody tends to
care deeply about throughput because they care much much *MUCH* more
about latency, we have a rather stupid big hammer approach.

It's called "vm_dirty_bytes".

Well, that's the knob (not the only one). The actual logic around it
is then quite the moreass of turning that into the
dirty_throttle_control, and the per-bdi dirty limits that try to take
the throughput of the backing device into account etc etc.

And then all those heuristics are used to actually LITERALLY PAUSE the
writer. We literally have this code:

                __set_current_state(TASK_KILLABLE);
                bdi->last_bdp_sleep = jiffies;
                io_schedule_timeout(pause);

in balance_dirty_pages(), which is all about saying "I'm putting you
to sleep, because I judge you to have dirtied so much memory that
you're making things worse for others".

And a lot of *that* is then because we haven't wanted everybody to
rush in and start their own synchronous writeback, but instead watn
all writeback to be done by somebody else. So now we move from
mm/page-writeback.c to fs/fs-writeback.c, and all the work-queues to
do dirty writeout.

Notice how the io_schedule_timeout() above doesn't even get woken up
by IO completing. Nope. The "you have written too much" logic
literally pauses the writer, and doesn't even want to wake it up when
there is no more dirty data.

So the "you went over the dirty limits It's a penalty box, and all of
this comes from "you are doing something that is abnormal and that
disturbs other people, so you get an unconditional penalty". Yes, the
timeout is then obviously tied to how much of a problem the dirtying
is (based on that whole "how fast is the device") but it's purely a
heuristic.

And (one) important part here is "nobody sane does that".  So
benchmarking this is a bit crazy. The code is literally meant for bad
actors, and what you are benchmarking is the kernel telling you "don't
do that then".

And absolutely *NONE* of this all has anything to do with the page cache. NADA.

And yes, there's literally thousands of lines of code all explicitly
designed for this "slow down writers" and make it be at least somewhat
graceful and gradual.

That's pretty much all mm/page-writeback.c does (yes, that file *also*
does have the "start/end folio writeback" functions, but they are only
a small part of it, even if that's obviously the origin of the file -
the writeback throttling logic has just grown a lot more).

               Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-24  4:12 ` Matthew Wilcox
  2024-02-24 17:31   ` Linus Torvalds
@ 2024-02-24 17:55   ` Luis Chamberlain
  1 sibling, 0 replies; 87+ messages in thread
From: Luis Chamberlain @ 2024-02-24 17:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner, Linus Torvalds

On Sat, Feb 24, 2024 at 04:12:31AM +0000, Matthew Wilcox wrote:
> On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote:
> >   ~86 GiB/s on pmem DIO on xfs with 64k block size, 1024 XFS agcount on x86_64
> >      Vs
> >  ~ 7,000 MiB/s with buffered IO
> 
> Profile?  My guess is that you're bottlenecked on the xa_lock between
> memory reclaim removing folios from the page cache and the various
> threads adding folios to the page cache.

If it was lock contention I was hoping to use perf lock record on fio, then
perf lock report -F acquired,contended,avg_wait,wait_total

If the contention was on locking xa_lock, it would creep up here, no?

                Name   acquired  contended     avg wait   total wait 

   cgroup_rstat_lock      90132      90132     26.41 us      2.38 s  
         event_mutex      32538      32538      1.40 ms     45.61 s  
                          23476      23476    123.48 us      2.90 s  
                          20803      20803     47.58 us    989.73 ms 
                          11921      11921     31.19 us    371.82 ms 
                           9389       9389    102.65 us    963.80 ms 
                           7763       7763     21.86 us    169.69 ms 
                           1736       1736     15.49 us     26.89 ms 
                            743        743    308.30 us    229.07 ms 
                            667        667    269.69 us    179.88 ms 
                            522        522     36.64 us     19.13 ms 
                            335        335     19.38 us      6.49 ms 
                            328        328    157.10 us     51.53 ms 
                            296        296    278.22 us     82.35 ms 
                            288        288    214.82 us     61.87 ms 
                            282        282    314.38 us     88.65 ms 
                            275        275    128.98 us     35.47 ms 
                            269        269    141.99 us     38.19 ms 
                            264        264    277.73 us     73.32 ms 
                            260        260    160.02 us     41.61 ms 
         event_mutex        251        251    242.03 us     60.75 ms 
                            248        248     12.47 us      3.09 ms 
                            246        246    328.33 us     80.77 ms 
                            245        245    189.83 us     46.51 ms 
                            245        245    275.17 us     67.42 ms 
                            235        235    152.49 us     35.84 ms 
                            235        235     38.55 us      9.06 ms 
                            228        228    137.27 us     31.30 ms 
                            224        224     94.65 us     21.20 ms 
                            221        221    198.13 us     43.79 ms 
                            220        220    411.64 us     90.56 ms 
                            214        214    291.08 us     62.29 ms 
                            209        209    132.94 us     27.79 ms 
                            207        207    364.20 us     75.39 ms 
                            204        204    346.68 us     70.72 ms 
                            194        194    169.77 us     32.94 ms 
                            181        181    137.87 us     24.95 ms 
                            181        181    154.78 us     28.01 ms 
                            172        172    145.11 us     24.96 ms 
                            169        169    124.30 us     21.01 ms 
                            168        168    378.92 us     63.66 ms 
                            161        161     91.64 us     14.75 ms 
                            161        161    264.51 us     42.59 ms 
                            153        153     85.53 us     13.09 ms 
                            150        150    383.28 us     57.49 ms 
                            148        148     91.24 us     13.50 ms 

I'll have to nose dive some more.. but for the life of me I can't see
the expected xa_lock contention.

  Luis

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-24 17:31   ` Linus Torvalds
@ 2024-02-24 18:13     ` Matthew Wilcox
  2024-02-24 18:24       ` Linus Torvalds
  2024-02-24 18:20     ` Linus Torvalds
  2024-02-25  5:18     ` Kent Overstreet
  2 siblings, 1 reply; 87+ messages in thread
From: Matthew Wilcox @ 2024-02-24 18:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez,
	Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig,
	Chris Mason, Johannes Weiner

On Sat, Feb 24, 2024 at 09:31:44AM -0800, Linus Torvalds wrote:
> On Fri, 23 Feb 2024 at 20:12, Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote:
> > >  What are the limits to buffered IO
> > > and how do we test that? Who keeps track of it?
> >
> > TLDR: Why does the pagecache suck?
> 
> What? No.
> 
> Our page cache is so good that the question is literally "what are the
> limits of it", and "how we would measure them".
> 
> That's not a sign of suckage.

For a lot of things our pagecache is amazing.  And in other ways it
absolutely sucks.  The trick will be fixing those less-used glass jaws
without damaging the common cases.

> When you have to have completely unrealistic loads that nobody would
> actually care about in reality just to get a number for the limit,
> it's not a sign of problems.

No, but sometimes the unrealistic loads are, alas, a good proxy for
problems that customers hit.  For example, I have one where the customer
does an overnight backup with a shitty backup program that doesn't use
O_DIRECT and ends up evicting the actual working set from the page
cache.  They start work the next morning with terrible performance
because everything they care about has been swapped out.  The "fix"
is to limit the pagecache to one NUMA node.  I suspect if this customer
could be persuaded to run a more recent kernel that this problem has
been solved, so I'm not sure there's a call to action from this
particular case.

Anyway, if there's a way to fix an unrealistic load that doesn't affect
realistic loads, sometimes we fix a customer problem too.

> Guess what? It's because the CPU in question had quite a bit of L3,
> and it was spread out, and the CPU doesn't even start the memory
> access before it has checked caches.
> 
> And here's a big honking clue: only a complete nincompoop and mentally
> deficient rodent would look at that and say "caches suck".

although the problem might be that the CPU has a terrible cross-chiplet
interconnect ;-)

> > >  ~86 GiB/s on pmem DIO on xfs with 64k block size, 1024 XFS agcount on x86_64
> > >      Vs
> > >  ~ 7,000 MiB/s with buffered IO
> >
> > Profile?  My guess is that you're bottlenecked on the xa_lock between
> > memory reclaim removing folios from the page cache and the various
> > threads adding folios to the page cache.
> 
> I doubt it's the locking.

It might not be!  But there are read-only workloads that do bottleneck
on the xa_lock.

> For writeout, we have a very traditional problem: we care about a
> million times more about latency than we care about throughput,
> because nobody ever actually cares all that much about performance of
> huge writes.
> 
> Ask yourself when you have last *really* sat there waiting for writes,
> unless it's some dog-slow USB device that writes at 100kB/s?

You picked a bad day to send this email ;-)

$ sudo dd if=Downloads/debian-testing-amd64-netinst.iso of=/dev/sda
[sudo] password for willy:
1366016+0 records in
1366016+0 records out
699400192 bytes (699 MB, 667 MiB) copied, 296.219 s, 2.4 MB/s

ok, that was a cheap-arse USB stick, but then I had to wait for the
installer to write 800GB of random data to /dev/nvme0n1p3 as it set up
the crypto drive.  The Debian installer didn't time that for me, but it
was enough time to vacuum the couch.

> Now, the benchmark that Luis highlighted is a completely different
> class of historical problems that has been around forever, namely the
> "fill up lots of memory with dirty data".
> 
> And there - because the problem is easy to trigger but nobody tends to
> care deeply about throughput because they care much much *MUCH* more
> about latency, we have a rather stupid big hammer approach.
> 
> It's called "vm_dirty_bytes".
> 
> Well, that's the knob (not the only one). The actual logic around it
> is then quite the moreass of turning that into the
> dirty_throttle_control, and the per-bdi dirty limits that try to take
> the throughput of the backing device into account etc etc.
> 
> And then all those heuristics are used to actually LITERALLY PAUSE the
> writer. We literally have this code:
> 
>                 __set_current_state(TASK_KILLABLE);
>                 bdi->last_bdp_sleep = jiffies;
>                 io_schedule_timeout(pause);
> 
> in balance_dirty_pages(), which is all about saying "I'm putting you
> to sleep, because I judge you to have dirtied so much memory that
> you're making things worse for others".
> 
> And a lot of *that* is then because we haven't wanted everybody to
> rush in and start their own synchronous writeback, but instead watn
> all writeback to be done by somebody else. So now we move from
> mm/page-writeback.c to fs/fs-writeback.c, and all the work-queues to
> do dirty writeout.
> 
> Notice how the io_schedule_timeout() above doesn't even get woken up
> by IO completing. Nope. The "you have written too much" logic
> literally pauses the writer, and doesn't even want to wake it up when
> there is no more dirty data.
> 
> So the "you went over the dirty limits It's a penalty box, and all of
> this comes from "you are doing something that is abnormal and that
> disturbs other people, so you get an unconditional penalty". Yes, the
> timeout is then obviously tied to how much of a problem the dirtying
> is (based on that whole "how fast is the device") but it's purely a
> heuristic.
> 
> And (one) important part here is "nobody sane does that".  So
> benchmarking this is a bit crazy. The code is literally meant for bad
> actors, and what you are benchmarking is the kernel telling you "don't
> do that then".

This was a really good writeup, thanks.  I think this might need some
tuning (if it is what's going on).  When the underlying device can do
86GB/s and we're only getting 7GB/s, we could afford to let this writer
do a bit more.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-24 17:31   ` Linus Torvalds
  2024-02-24 18:13     ` Matthew Wilcox
@ 2024-02-24 18:20     ` Linus Torvalds
  2024-02-24 19:11       ` Linus Torvalds
  2024-02-25  5:18     ` Kent Overstreet
  2 siblings, 1 reply; 87+ messages in thread
From: Linus Torvalds @ 2024-02-24 18:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez,
	Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig,
	Chris Mason, Johannes Weiner

On Sat, 24 Feb 2024 at 09:31, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And (one) important part here is "nobody sane does that".  So
> benchmarking this is a bit crazy. The code is literally meant for bad
> actors, and what you are benchmarking is the kernel telling you "don't
> do that then".

Side note: one reason why the big hammer approach of "don't do that"
has worked so well is that the few loads that *do* want to do this and
have a valid reason to write large amounts of data in one go are
generally trivially translated to O_DIRECT.

For example, if you actually do things like write disk images etc,
O_DIRECT is lovely and easy - even trivial - to use. You don't even
have to write code for it, you can (and people do) just use 'dd' with
'oflag=direct'. So even trivial shell scripting has access to the
"don't do that then" flag.

In other words, I really think that Luis' benchmark triggers that
kernel "you are doing something actively wrong and stupid" logic. It's
not the kernel trying to optimize writeback. It's the kernel trying to
protect others from stupid loads.

Now, I'm also not saying that you should benchmark this with our
"vm_dirty_bytes" logic disabled. That may indeed help performance on
that benchmark, but you'll just hit other problem spots instead. Once
you fill up lots of memory, other problems become really big and
nasty, so you would then need *other* fixes for those issues.

If somebody really cares about this kind of load, and cannot use
O_DIRECT for some reason ("I actually do want caches 99% of the
time"), I suspect the solution is to have some slightly gentler way to
say "instead of the throttling logic, I want you to start my writeouts
much more synchronously".

IOW, we could have a writer flag that still uses the page cache, but
that instead of that

                balance_dirty_pages_ratelimited(mapping);

in generic_perform_write(), it would actually synchronously *start*
the write, that might work a whole lot better for any load that still
wants to do big streaming writes, but wants to also keep the page
cache component.

                Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-24 18:13     ` Matthew Wilcox
@ 2024-02-24 18:24       ` Linus Torvalds
  0 siblings, 0 replies; 87+ messages in thread
From: Linus Torvalds @ 2024-02-24 18:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez,
	Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig,
	Chris Mason, Johannes Weiner

On Sat, 24 Feb 2024 at 10:13, Matthew Wilcox <willy@infradead.org> wrote:
>
> > Ask yourself when you have last *really* sat there waiting for writes,
> > unless it's some dog-slow USB device that writes at 100kB/s?
>
> You picked a bad day to send this email ;-)
>
> $ sudo dd if=Downloads/debian-testing-amd64-netinst.iso of=/dev/sda

What? No. I literally picked the example you then did.

And you in fact were being clueless, and the kernel damn well should
tell you that you are being an ass, and should slow your writes down
so that you don't hurt others.

Which is what it did.

You should have damn well used oflag=direct, and if you didn't do
that, then the onus is all on YOU.

The kernel really isn't there to fix your stupid errors for you. You
do something stupid, the kernel will give you rope and happily say "do
you want more?"

                 Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-24 18:20     ` Linus Torvalds
@ 2024-02-24 19:11       ` Linus Torvalds
  2024-02-24 21:42         ` Theodore Ts'o
  2024-02-24 22:57         ` Chris Mason
  0 siblings, 2 replies; 87+ messages in thread
From: Linus Torvalds @ 2024-02-24 19:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez,
	Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig,
	Chris Mason, Johannes Weiner

On Sat, 24 Feb 2024 at 10:20, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> If somebody really cares about this kind of load, and cannot use
> O_DIRECT for some reason ("I actually do want caches 99% of the
> time"), I suspect the solution is to have some slightly gentler way to
> say "instead of the throttling logic, I want you to start my writeouts
> much more synchronously".
>
> IOW, we could have a writer flag that still uses the page cache, but
> that instead of that
>
>                 balance_dirty_pages_ratelimited(mapping);

I was *sure* we had had some work in this area, and yup, there's a
series from 2019 by Konstantin Khlebnikov to implement write-behind.

Some digging in the lore archives found this

    https://lore.kernel.org/lkml/156896493723.4334.13340481207144634918.stgit@buzz/

but I don't remember what then happened to it.  It clearly never went
anywhere, although I think something _like_ that is quite possibly the
right thing to do (and I was fairly positive about the patch at the
time).

I have this feeling that there's been other attempts of write-behind
in this area, but that thread was the only one I found from my quick
search.

I'm not saying Konstanti's patch is the thing to do, and I suspect we
might want to actually have some way for people to say at open-time
that "I want write-behind", but it looks like at least a starting
point.

But it is possible that this work never went anywhere exactly because
this is such a rare case. That kind of "write so much that you want to
do something special" is often such a special thing that using
O_DIRECT is generally the trivial solution.

                   Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-24 19:11       ` Linus Torvalds
@ 2024-02-24 21:42         ` Theodore Ts'o
  2024-02-24 22:57         ` Chris Mason
  1 sibling, 0 replies; 87+ messages in thread
From: Theodore Ts'o @ 2024-02-24 21:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel,
	linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Sat, Feb 24, 2024 at 11:11:28AM -0800, Linus Torvalds wrote:
> But it is possible that this work never went anywhere exactly because
> this is such a rare case. That kind of "write so much that you want to
> do something special" is often such a special thing that using
> O_DIRECT is generally the trivial solution.

Well, actually there's a relatively common workload where we do this
exact same thing --- and that's when we run mkfs.ext[234] / mke2fs.
We issue a huge number of buffered writes (at least, if the device
doesn't support a zeroing discard operation) to zero out the inode
table.  We rely on the mm subsystem putting mke2fs "into the penalty
box", or else some process (usually mke2fs) will get OOM-killed.

I don't consider it a "penalty" --- in fact, when write throttling
doesn't work, I've complained that it's an mm bug.  (Sometimes this
has broken when the mke2fs process runs out of physical memory, and
sometimes it has broken when the mke2fs runs into the memory cgroup
limit; it's one of those things that's seems to break every 3-5
years.)  But still, it's something which *must* work, because it's
really not reasonable for userspace to know what is a reasonable rate
to self-throttling buffered writes --- it's something the kernel
should do for the userspace process.

Because this is something that has broken more than once, we have two
workarounds in mke2fs; one is that we can call fsync(2) every N block
group's worth of inode tables, which is kind of a hack, and the other
is that we can use Direct I/O.  But using DIO has a worse user
experience (well, unless the alternative is mke2fs getting OOM-killed;
admittedly that's worse) than just using buffered I/O, since we
generally don't need to synchronously wait for the write requests to
complete.  Neither is enabled by default, because in my view, this is
something the mm should just get right, darn it.

In any case, I definitely don't consider write throttled to be a
performance "problem" --- it's actually a far worse problem when the
throttling doesn't happen, because it generally means someone is
getting OOM-killed.

						- Ted

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-24 19:11       ` Linus Torvalds
  2024-02-24 21:42         ` Theodore Ts'o
@ 2024-02-24 22:57         ` Chris Mason
  2024-02-24 23:40           ` Linus Torvalds
  2024-05-10 23:57           ` Luis Chamberlain
  1 sibling, 2 replies; 87+ messages in thread
From: Chris Mason @ 2024-02-24 22:57 UTC (permalink / raw)
  To: Linus Torvalds, Matthew Wilcox
  Cc: Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez,
	Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig,
	Chris Mason, Johannes Weiner

On 2/24/24 2:11 PM, Linus Torvalds wrote:
> On Sat, 24 Feb 2024 at 10:20, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> If somebody really cares about this kind of load, and cannot use
>> O_DIRECT for some reason ("I actually do want caches 99% of the
>> time"), I suspect the solution is to have some slightly gentler way to
>> say "instead of the throttling logic, I want you to start my writeouts
>> much more synchronously".
>>
>> IOW, we could have a writer flag that still uses the page cache, but
>> that instead of that
>>
>>                 balance_dirty_pages_ratelimited(mapping);
> 
> I was *sure* we had had some work in this area, and yup, there's a
> series from 2019 by Konstantin Khlebnikov to implement write-behind.
> 
> Some digging in the lore archives found this
> 
>     https://lore.kernel.org/lkml/156896493723.4334.13340481207144634918.stgit@buzz/
> 
> but I don't remember what then happened to it.  It clearly never went
> anywhere, although I think something _like_ that is quite possibly the
> right thing to do (and I was fairly positive about the patch at the
> time).
> 
> I have this feeling that there's been other attempts of write-behind
> in this area, but that thread was the only one I found from my quick
> search.
> 
> I'm not saying Konstanti's patch is the thing to do, and I suspect we
> might want to actually have some way for people to say at open-time
> that "I want write-behind", but it looks like at least a starting
> point.
> 
> But it is possible that this work never went anywhere exactly because
> this is such a rare case. That kind of "write so much that you want to
> do something special" is often such a special thing that using
> O_DIRECT is generally the trivial solution.

For teams that really more control over dirty pages with existing APIs,
I've suggested using sync_file_range periodically.  It seems to work
pretty well, and they can adjust the sizes and frequency as needed.

Managing clean pages has been a problem with workloads that really care
about p99 allocation latency.  We've had issues where kswapd saturates a
core throwing away all the clean pages from either streaming readers or
writers.

To reproduce on 6.8-rc5, I did buffered IO onto a 6 drive raid0 via MD.
Max possible tput seems to be 8GB/s writes, and the box has 256GB of ram
across two sockets.  For buffered IO onto md0, we're hitting about
1.2GB/s, and have a core saturated by a kworker doing writepages.

From time to time, our random crud that maintains the system will need a
lot of memory and kswapd will saturate a core, but this tends to resolve
itself after 10-20 seconds.  Our ultra sensitive workloads would
complain, but they manage the page cache more explicitly to avoid these
situations.

The raid0 is fast enough that we never hit the synchronous dirty page
limit.  fio is just 100% CPU bound, and when kswapd saturates a core,
it's just freeing clean pages.

With filesystems in use, kswapd and the writepages kworkers are better
behaved, which just makes me think writepages on blockdevices have seen
less optimization, not really a huge surprise.  Filesystems can push the
full 8GB/s tput either buffered or O_DIRECT.

With streaming writes to a small number of large files, total free
memory might get down to 1.5GB on the 256GB machine, with most of the
rest being clean page cache.

If I instead write to millions of 1MB files, free memory refuses to go
below 12GB, and kswapd doesn't misbehave at all.  We're still pushing
7GB/s writes.

Not a lot of conclusions, other than it's not that hard to use clean
page cache to make the system slower than some workloads are willing to
tolerate.

Ignoring widly slow devices, the dirty limits seem to work well enough
on both big and small systems that I haven't needed to investigate
issues there as often.

Going back to Luis's original email, I'd echo Willy's suggestion for
profiles.  Unless we're saturating memory bandwidth, buffered should be
able to get much closer to O_DIRECT, just at a much higher overall cost.

-chris


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-24 22:57         ` Chris Mason
@ 2024-02-24 23:40           ` Linus Torvalds
  2024-05-10 23:57           ` Luis Chamberlain
  1 sibling, 0 replies; 87+ messages in thread
From: Linus Torvalds @ 2024-02-24 23:40 UTC (permalink / raw)
  To: Chris Mason
  Cc: Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel,
	linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Sat, 24 Feb 2024 at 14:58, Chris Mason <clm@meta.com> wrote:
>
> For teams that really more control over dirty pages with existing APIs,
> I've suggested using sync_file_range periodically.  It seems to work
> pretty well, and they can adjust the sizes and frequency as needed.

Yes. I've written code like that myself.

That said, that is also fairly close to what the write-behind patches
I pointed at did.

One issue (and maybe that was what killed that write-behind patch) is
that there are *other* benchmarks that are actually slightly more
realistic that do things like "untar a tar-file, do something with it,
and them 'rm -rf' it all again".

And *those* benchmarks behave best when the IO is never ever actually
done at all. And unlike the "write a terabyte with random IO", those
benchmarks actually approximate a few somewhat real loads (I'm not
claiming they are good, but the "create files, do something, then
remove them" pattern at least _exists_ in real life).

For things like block device write for a 'mkfs' run, the whole "this
file may be deleted soon, so let's not even start the write in the
first place" behavior doesn't exist, of course. Starting writeback
much more aggressively for those is probably not a bad idea.

> From time to time, our random crud that maintains the system will need a
> lot of memory and kswapd will saturate a core, but this tends to resolve
> itself after 10-20 seconds.  Our ultra sensitive workloads would
> complain, but they manage the page cache more explicitly to avoid these
> situations.

You can see these things with slow USB devices with much more obvious
results. Including long spikes of total inactivity if some system
piece ends up doing a "sync" for some reason. It happens. It's very
annoying.

My gut feel is that it happens a lot less these days than it used to,
but I suspect that's at least partly because I don't see the slow USB
devices very much any more.

> Ignoring widly slow devices, the dirty limits seem to work well enough
> on both big and small systems that I haven't needed to investigate
> issues there as often.

One particular problem point used to be backing devices with wildly
different IO throughput, because I think the speed heuristics don't
necessarily always work all that well at least initially.

And things like that may partly explain your "filesystems work better
than block devices".  It doesn't necessarily have to be about
filesystems vs block devices per se, and be instead about things like
"on a filesystem, the bdi throughput numbers have had time to
stabilize".

In contrast, a benchmark that uses soem other random device that
doesn't look like a regular disk (whether it's really slow like a bad
USB device, or really fast like pmem), you might see more issues. And
I wouldn't be in the least surprised if that is part of the situation
Luis sees.

              Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-24 17:31   ` Linus Torvalds
  2024-02-24 18:13     ` Matthew Wilcox
  2024-02-24 18:20     ` Linus Torvalds
@ 2024-02-25  5:18     ` Kent Overstreet
  2024-02-25  6:04       ` Kent Overstreet
  2024-02-25 13:10       ` Matthew Wilcox
  2 siblings, 2 replies; 87+ messages in thread
From: Kent Overstreet @ 2024-02-25  5:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel,
	linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Sat, Feb 24, 2024 at 09:31:44AM -0800, Linus Torvalds wrote:
> On Fri, 23 Feb 2024 at 20:12, Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote:
> > >  What are the limits to buffered IO
> > > and how do we test that? Who keeps track of it?
> >
> > TLDR: Why does the pagecache suck?
> 
> What? No.
> 
> Our page cache is so good that the question is literally "what are the
> limits of it", and "how we would measure them".
> 
> That's not a sign of suckage.
> 
> When you have to have completely unrealistic loads that nobody would
> actually care about in reality just to get a number for the limit,
> it's not a sign of problems.
> 
> Or rather, the "problem" is the person looking at a stupid load, and
> going "we need to improve this because I can write a benchmark for
> this".
> 
> Here's a clue: a hardware discussion forum I visit was arguing about
> memory latencies, and talking about how their measured overhead of
> DRAM latency was literally 85% on the CPU side, not the DRAM side.
> 
> Guess what? It's because the CPU in question had quite a bit of L3,
> and it was spread out, and the CPU doesn't even start the memory
> access before it has checked caches.
> 
> And here's a big honking clue: only a complete nincompoop and mentally
> deficient rodent would look at that and say "caches suck".
> 
> > >  ~86 GiB/s on pmem DIO on xfs with 64k block size, 1024 XFS agcount on x86_64
> > >      Vs
> > >  ~ 7,000 MiB/s with buffered IO
> >
> > Profile?  My guess is that you're bottlenecked on the xa_lock between
> > memory reclaim removing folios from the page cache and the various
> > threads adding folios to the page cache.
> 
> I doubt it's the locking.
> 
> In fact, for writeout in particular it's probably not even the page
> cache at all.
> 
> For writeout, we have a very traditional problem: we care about a
> million times more about latency than we care about throughput,
> because nobody ever actually cares all that much about performance of
> huge writes.

Before large folios, we had people very much bottlenecked by 4k page
overhead on sequential IO; my customer/sponsor was one of them.

Factor of 2 or 3, IIRC; it was _bad_. And when you looked at the
profiles and looked at the filemap.c code it wasn't hard to see why;
we'd walk a radix tree, do an atomic op (get the page), then do a 4k
usercopy... hence the work I did to break up
generic_file_buffered_read() and vectorize it, which was a huge
improvement.

It's definitely less of a factor when post large folios and when we're
talking about workloads that don't fit in cache, but I always wanted to
do a generic version of the vectorized write path that brfs and bcachefs
have.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-23 23:59 [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO Luis Chamberlain
  2024-02-24  4:12 ` Matthew Wilcox
@ 2024-02-25  5:24 ` Kent Overstreet
  2024-02-26 12:22 ` Dave Chinner
  2024-02-27 10:07 ` Kent Overstreet
  3 siblings, 0 replies; 87+ messages in thread
From: Kent Overstreet @ 2024-02-25  5:24 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner, Matthew Wilcox, Linus Torvalds

On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote:
> Part of the testing we have done with LBS was to do some performance
> tests on XFS to ensure things are not regressing. Building linux is a
> fine decent test and we did some random cloud instance tests on that and
> presented that at Plumbers, but it doesn't really cut it if we want to
> push things to the limit though. What are the limits to buffered IO
> and how do we test that? Who keeps track of it?
> 
> The obvious recurring tension is that for really high performance folks
> just recommend to use birect IO. But if you are stress testing changes
> to a filesystem and want to push buffered IO to its limits it makes
> sense to stick to buffered IO, otherwise how else do we test it?
> 
> It is good to know limits to buffered IO too because some workloads
> cannot use direct IO.  For instance PostgreSQL doesn't have direct IO
> support and even as late as the end of last year we learned that adding
> direct IO to PostgreSQL would be difficult.  Chris Mason has noted also
> that direct IO can also force writes during reads (?)... Anyway, testing
> the limits of buffered IO limits to ensure you are not creating
> regressions when doing some page cache surgery seems like it might be
> useful and a sensible thing to do .... The good news is we have not found
> regressions with LBS but all the testing seems to beg the question, of what
> are the limits of buffered IO anyway, and how does it scale? Do we know, do
> we care? Do we keep track of it? How does it compare to direct IO for some
> workloads? How big is the delta? How do we best test that? How do we
> automate all that? Do we want to automatically test this to avoid regressions?
> 
> The obvious issues with some workloads for buffered IO is having a
> possible penality if you are not really re-using folios added to the
> page cache. Jens Axboe reported a while ago issues with workloads with
> random reads over a data set 10x the size of RAM and also proposed
> RWF_UNCACHED as a way to help [0]. As Chinner put it, this seemed more
> like direct IO with kernel pages and a memcpy(), and it requires
> further serialization to be implemented that we already do for
> direct IO for writes. There at least seems to be agreement that if we're
> going to provide an enhancement or alternative that we should strive to not
> make the same mistakes we've done with direct IO. The rationale for some
> workloads to use buffered IO is it helps reduce some tail latencies, so
> that's something to live up to.
> 
> On that same thread Christoph also mentioned the possibility of a direct
> IO variant which can leverage the cache. Is that something we want to
> move forward with?

The thing to consider here would be an improved O_SYNC. There's a fair
amount of tree walking and thread to thread cacheline bouncing that
would be avoided by just calling .write_folios() and kicking bios off
from .write_iter().

OTOH - the way it's done now is probably the best possible way of
splitting up the work between multiple threads, so I'd expect this
approach to get less throughput than current O_SYNC.

Luis, are you profiling these workloads? I haven't looked at high
throughput profiles of the buffered IO path in years, and that's a good
place to start.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-25  5:18     ` Kent Overstreet
@ 2024-02-25  6:04       ` Kent Overstreet
  2024-02-25 13:10       ` Matthew Wilcox
  1 sibling, 0 replies; 87+ messages in thread
From: Kent Overstreet @ 2024-02-25  6:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel,
	linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Sun, Feb 25, 2024 at 12:18:23AM -0500, Kent Overstreet wrote:
> On Sat, Feb 24, 2024 at 09:31:44AM -0800, Linus Torvalds wrote:
> Before large folios, we had people very much bottlenecked by 4k page
> overhead on sequential IO; my customer/sponsor was one of them.
> 
> Factor of 2 or 3, IIRC; it was _bad_. And when you looked at the
> profiles and looked at the filemap.c code it wasn't hard to see why;
> we'd walk a radix tree, do an atomic op (get the page), then do a 4k
> usercopy... hence the work I did to break up
> generic_file_buffered_read() and vectorize it, which was a huge
> improvement.
> 
> It's definitely less of a factor when post large folios and when we're
> talking about workloads that don't fit in cache, but I always wanted to
> do a generic version of the vectorized write path that brfs and bcachefs
> have.

to expound further, our buffered io performance really is crap vs.
direct in lots of real world scenarios, and what was going on in
generic_file_buffered_read() was just one instance of a larger theme -
walking data structures, taking locks/atomics/barriers, then doing work
on the page/folio with cacheline bounces, in a loop - lots of places
where batching/vectorizing would help a lot but it tends to be
insufficient.

i had patches that went further than the generic_file_buffered_read()
rework to vectorize add_to_page_cache_lru(), and that was another
significant improvement.

the pagecache lru operations were another hot spot... willy and I at one
point were spitballing getting rid of the linked list for a dequeue,
more for getting rid of the list_head in struct page/folio and replacing
it with a single size_t index, but it'd open up more vectorizing
possibilities

i give willy crap about the .readahead interface... the way we make the
filesystem code walk the xarray to get the folios instead of just
passing it a vector is stupid

folio_batch is stupid, it shouldn't be fixed size. there's no reason for
that to be a small fixed size array on the stack, the slub fastpath has
no atomic ops and doesn't disable preemption or interrupts - it's
_fast_. just use a darray and vectorize the whole operation

but that wouldn't be the big gains, bigger would be hunting down all the
places that aren't vectorized and should be.

i haven't reviewed the recent .writepages work christoph et all are
doing, if that's properly vectorized now that'll help

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-25  5:18     ` Kent Overstreet
  2024-02-25  6:04       ` Kent Overstreet
@ 2024-02-25 13:10       ` Matthew Wilcox
  2024-02-25 17:03         ` Linus Torvalds
  2024-02-25 17:32         ` Kent Overstreet
  1 sibling, 2 replies; 87+ messages in thread
From: Matthew Wilcox @ 2024-02-25 13:10 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Linus Torvalds, Luis Chamberlain, lsf-pc, linux-fsdevel,
	linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Sun, Feb 25, 2024 at 12:18:23AM -0500, Kent Overstreet wrote:
> Before large folios, we had people very much bottlenecked by 4k page
> overhead on sequential IO; my customer/sponsor was one of them.
> 
> Factor of 2 or 3, IIRC; it was _bad_. And when you looked at the
> profiles and looked at the filemap.c code it wasn't hard to see why;
> we'd walk a radix tree, do an atomic op (get the page), then do a 4k
> usercopy... hence the work I did to break up
> generic_file_buffered_read() and vectorize it, which was a huge
> improvement.

There's also the small random 64 byte read case that we haven't optimised
for yet.  That also bottlenecks on the page refcount atomic op.

The proposed solution to that was double-copy; look up the page without
bumping its refcount, copy to a buffer, look up the page again to be
sure it's still there, copy from the buffer to userspace.

Except that can go wrong under really unlikely circumstances.  Look up the
page, page gets freed, page gets reallocated to slab, we copy sensitive
data from it, page gets freed again, page gets reallocated to the same
spot in the file (!), lookup says "yup the same page is there".
We'd need a seqcount or something to be sure the page hasn't moved.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-25 13:10       ` Matthew Wilcox
@ 2024-02-25 17:03         ` Linus Torvalds
  2024-02-25 21:14           ` Matthew Wilcox
  2024-02-25 21:29           ` Kent Overstreet
  2024-02-25 17:32         ` Kent Overstreet
  1 sibling, 2 replies; 87+ messages in thread
From: Linus Torvalds @ 2024-02-25 17:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kent Overstreet, Luis Chamberlain, lsf-pc, linux-fsdevel,
	linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Sun, 25 Feb 2024 at 05:10, Matthew Wilcox <willy@infradead.org> wrote:
>
> There's also the small random 64 byte read case that we haven't optimised
> for yet.  That also bottlenecks on the page refcount atomic op.
>
> The proposed solution to that was double-copy; look up the page without
> bumping its refcount, copy to a buffer, look up the page again to be
> sure it's still there, copy from the buffer to userspace.

Please stop the cray-cray.

Yes, cache dirtying is expensive. But you don't actually have
cacheline ping-pong, because you don't have lots of different CPU's
hammering the same page cache page in any normal circumstances. So the
really expensive stuff just doesn't exist.

I think you've been staring at profiles too much. In instruction-level
profiles, the atomic ops stand out a lot. But that's at least partly
artificial - they are a serialization point on x86, so things get
accounted to them. So they tend to be the collection point for
everything around them in an OoO CPU.

Yes, atomics are bad. But double buffering is worse, and only looks
good if you have some artificial benchmark that does some single-byte
hot-cache read in a loop.

In fact, I get the strong feeling that the complaints come from people
who have looked at bad microbenchmarks a bit too much. People who have
artificially removed the *real* costs by putting their data on a
ramdisk, and then run a microbenchmark on this artificial setup.

So you have a make-believe benchmark on a make-believe platform, and
you may have started out with the best of intentions ("what are the
limits"), but at some point you took a wrong turn, and turned that
"what are the limits of performance" and turned that into an
instruction-level profile and tried to mis-optimize the limits,
instead of realizing that that is NOT THE POINT of a "what are the
limits" question.

The point of doing limit analysis is not to optimize the limit. It's
to see how close you are to that limit in real loads.

And I pretty much guarantee that you aren't close to those limits on
any real loads.

Before filesystem people start doing crazy things like double
buffering to do RCU reading of the page cache, you need to look
yourself in the mirror.

Fior example, the fact that Kent complains about the page cache and
talks about large folios is completely ludicrous. I've seen the
benchmarks of real loads. Kent - you're not close to any limits, you
are often a factor of two to five off other filesystems. We're not
talking "a few percent", and we're not talking "the atomics are
hurting".

So people: wake up and smell the coffee.  Don't optimize based off
profiles of micro-benchmarks on made up platforms. That's for seeing
where the limits are. And YOU ARE NOT EVEN CLOSE.

                  Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-25 13:10       ` Matthew Wilcox
  2024-02-25 17:03         ` Linus Torvalds
@ 2024-02-25 17:32         ` Kent Overstreet
  1 sibling, 0 replies; 87+ messages in thread
From: Kent Overstreet @ 2024-02-25 17:32 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Luis Chamberlain, lsf-pc, linux-fsdevel,
	linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Sun, Feb 25, 2024 at 01:10:39PM +0000, Matthew Wilcox wrote:
> On Sun, Feb 25, 2024 at 12:18:23AM -0500, Kent Overstreet wrote:
> > Before large folios, we had people very much bottlenecked by 4k page
> > overhead on sequential IO; my customer/sponsor was one of them.
> > 
> > Factor of 2 or 3, IIRC; it was _bad_. And when you looked at the
> > profiles and looked at the filemap.c code it wasn't hard to see why;
> > we'd walk a radix tree, do an atomic op (get the page), then do a 4k
> > usercopy... hence the work I did to break up
> > generic_file_buffered_read() and vectorize it, which was a huge
> > improvement.
> 
> There's also the small random 64 byte read case that we haven't optimised
> for yet.  That also bottlenecks on the page refcount atomic op.
> 
> The proposed solution to that was double-copy; look up the page without
> bumping its refcount, copy to a buffer, look up the page again to be
> sure it's still there, copy from the buffer to userspace.
> 
> Except that can go wrong under really unlikely circumstances.  Look up the
> page, page gets freed, page gets reallocated to slab, we copy sensitive
> data from it, page gets freed again, page gets reallocated to the same
> spot in the file (!), lookup says "yup the same page is there".
> We'd need a seqcount or something to be sure the page hasn't moved.

yes, generation numbers are the standard solution to ABA...

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-25 17:03         ` Linus Torvalds
@ 2024-02-25 21:14           ` Matthew Wilcox
  2024-02-25 23:45             ` Linus Torvalds
  2024-02-25 21:29           ` Kent Overstreet
  1 sibling, 1 reply; 87+ messages in thread
From: Matthew Wilcox @ 2024-02-25 21:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kent Overstreet, Luis Chamberlain, lsf-pc, linux-fsdevel,
	linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Sun, Feb 25, 2024 at 09:03:32AM -0800, Linus Torvalds wrote:
> I think you've been staring at profiles too much. In instruction-level
> profiles, the atomic ops stand out a lot. But that's at least partly
> artificial - they are a serialization point on x86, so things get
> accounted to them. So they tend to be the collection point for
> everything around them in an OoO CPU.
> 
> Yes, atomics are bad. But double buffering is worse, and only looks
> good if you have some artificial benchmark that does some single-byte
> hot-cache read in a loop.

Not artificial; this was a real customer with a real workload.  I don't
know how much about it I can discuss publically, but my memory of it was a
system writing a log with 64 byte entries, millions of entries per second.
Occasionally the system would have to go back and look at an entry in the
last few seconds worth of data (so it would still be in the page cache).

This customer was quite savvy, so they actually implemented and tested
the lookup-copy-lookup-again algorithm in their custom kernel, and saw
a speedup from it.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-25 17:03         ` Linus Torvalds
  2024-02-25 21:14           ` Matthew Wilcox
@ 2024-02-25 21:29           ` Kent Overstreet
  1 sibling, 0 replies; 87+ messages in thread
From: Kent Overstreet @ 2024-02-25 21:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel,
	linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Sun, Feb 25, 2024 at 09:03:32AM -0800, Linus Torvalds wrote:
> On Sun, 25 Feb 2024 at 05:10, Matthew Wilcox <willy@infradead.org> wrote:
> >
> > There's also the small random 64 byte read case that we haven't optimised
> > for yet.  That also bottlenecks on the page refcount atomic op.
> >
> > The proposed solution to that was double-copy; look up the page without
> > bumping its refcount, copy to a buffer, look up the page again to be
> > sure it's still there, copy from the buffer to userspace.
> 
> Please stop the cray-cray.
> 
> Yes, cache dirtying is expensive. But you don't actually have
> cacheline ping-pong, because you don't have lots of different CPU's
> hammering the same page cache page in any normal circumstances. So the
> really expensive stuff just doesn't exist.

Not ping pong, you're just blowing the cachelines you want out of l1
with the big usercopy, hardware caches not being fully associative.

> I think you've been staring at profiles too much. In instruction-level
> profiles, the atomic ops stand out a lot. But that's at least partly
> artificial - they are a serialization point on x86, so things get
> accounted to them. So they tend to be the collection point for
> everything around them in an OoO CPU.

Yes, which leads to a fun game of whack a mole when you eliminate one
atomic op and then everything just ends up piling up behind a different
atomic op - but for the buffered read path, the folio get/put are the
only atomic ops.

> Fior example, the fact that Kent complains about the page cache and
> talks about large folios is completely ludicrous. I've seen the
> benchmarks of real loads. Kent - you're not close to any limits, you
> are often a factor of two to five off other filesystems. We're not
> talking "a few percent", and we're not talking "the atomics are
> hurting".

Yes, there's a bunch of places where bcachefs is still slow; it'll get
there :)

If you've got those benchmarks handyy and they're ones I haven't seen,
I'd love to take look; the one that always jumps out at people is small
O_DIRECT reads, and that hasn't been a priority because O_DIRECT doesn't
matter to most people nearly as much as they think it does.

There's a bunch of stuff still to work through; another that comes to
mind is that we need a free inodes btree to eliminate scanning in inode
create, and that was half a day of work - except it also needs sharding
(i.e. leaf nodes can't span certain boundaries), and for that I need
variable sized btree nodes so we aren't burning stupid amounts of
memory - and that's something we need anyways, number of btrees growing
like it is.

Another fun one that I just discovered while I was hanging out at
Darrick's - journal was stalling on high iodepth workloads; device write
buffer fills up, write latency goes up, suddenly the journal can't write
quickly enough when it's only submitting one write at a time. So there's
a fix for 6.9 queued up that lets the journal keep multiple writes in
flight.

That one was worth mentioning because another fix would've been to add a
way to signal backpressure to /above/ the filesystem, so that we don't
hit such big queuing delays within the filesystem; right now user writes
don't hit backpressure until submit_bio() blocks because the request
queue is full. I've been seeing other performance corner cases where it
looks like such a mechanism would be helpful.

I except I've got a solid year or two ahead of me of mastly just working
through performance bugs - standing up a lot of automated perf testing
adn whatnot. But, one thing at a time...

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-25 21:14           ` Matthew Wilcox
@ 2024-02-25 23:45             ` Linus Torvalds
  2024-02-26  1:02               ` Kent Overstreet
  0 siblings, 1 reply; 87+ messages in thread
From: Linus Torvalds @ 2024-02-25 23:45 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kent Overstreet, Luis Chamberlain, lsf-pc, linux-fsdevel,
	linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Sun, 25 Feb 2024 at 13:14, Matthew Wilcox <willy@infradead.org> wrote:
>
> Not artificial; this was a real customer with a real workload.  I don't
> know how much about it I can discuss publically, but my memory of it was a
> system writing a log with 64 byte entries, millions of entries per second.
> Occasionally the system would have to go back and look at an entry in the
> last few seconds worth of data (so it would still be in the page cache).

Honestly, that should never hit any kind of contention on the page cache.

Unless they did something else odd, that load should be entirely
serialized by the POSIX "atomic write" requirements and the
"inode_lock(inode)"  that writes take.

So it would end up literally being just one cache miss - and if you do
things across CPU's and have cachelines moving around, that inode lock
would be the bigger offender in that it is the one that would see any
contention.

Now, *that* is locking that I despise, much more than the page cache
lock.  It serializes unrelated writes to different areas, and the
direct-IO people instead said "we don't care about POSIX" and did
concurrent writes without it.

That said, I do wonder if we could take advantage of the fact that we
have the inode lock, and just make page eviction take that lock too
(possibly in shared form).

At that point, you really could just say "no need to increment the
reference count, because we can do writes knowing that the mapping
pages are stable".

Not pretty, but we could possibly at least take advantage of the
horrid other ugliness of the inode locking and POSIX rules that nobody
really wants.

              Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-25 23:45             ` Linus Torvalds
@ 2024-02-26  1:02               ` Kent Overstreet
  2024-02-26  1:32                 ` Linus Torvalds
  0 siblings, 1 reply; 87+ messages in thread
From: Kent Overstreet @ 2024-02-26  1:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel,
	linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Sun, Feb 25, 2024 at 03:45:47PM -0800, Linus Torvalds wrote:
> On Sun, 25 Feb 2024 at 13:14, Matthew Wilcox <willy@infradead.org> wrote:
> >
> > Not artificial; this was a real customer with a real workload.  I don't
> > know how much about it I can discuss publically, but my memory of it was a
> > system writing a log with 64 byte entries, millions of entries per second.
> > Occasionally the system would have to go back and look at an entry in the
> > last few seconds worth of data (so it would still be in the page cache).
> 
> Honestly, that should never hit any kind of contention on the page cache.
> 
> Unless they did something else odd, that load should be entirely
> serialized by the POSIX "atomic write" requirements and the
> "inode_lock(inode)"  that writes take.
> 
> So it would end up literally being just one cache miss - and if you do
> things across CPU's and have cachelines moving around, that inode lock
> would be the bigger offender in that it is the one that would see any
> contention.
> 
> Now, *that* is locking that I despise, much more than the page cache
> lock.  It serializes unrelated writes to different areas, and the
> direct-IO people instead said "we don't care about POSIX" and did
> concurrent writes without it.

We could satisfy the posix atomic writes rule by just having a properly
vectorized buffered write path, no need for the inode lock - it really
should just be extending writes that have to hit the inode lock, same as
O_DIRECT.

(whenever people bring up range locks, I keep trying to tell them - we
already have that in the form of the folio lock, if you'd just use it
properly...)

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-26  1:02               ` Kent Overstreet
@ 2024-02-26  1:32                 ` Linus Torvalds
  2024-02-26  1:58                   ` Kent Overstreet
  2024-02-26  2:50                   ` Al Viro
  0 siblings, 2 replies; 87+ messages in thread
From: Linus Torvalds @ 2024-02-26  1:32 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel,
	linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Sun, 25 Feb 2024 at 17:03, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> We could satisfy the posix atomic writes rule by just having a properly
> vectorized buffered write path, no need for the inode lock - it really
> should just be extending writes that have to hit the inode lock, same as
> O_DIRECT.
>
> (whenever people bring up range locks, I keep trying to tell them - we
> already have that in the form of the folio lock, if you'd just use it
> properly...)

Sadly, that is *technically* not proper.

IOW, I actually agree with you that the folio lock is sufficient, and
several filesystems do too.

BUT.

Technically, the POSIX requirements are that the atomicity of writes
are "all or nothing" being visible, and so ext4, for example, will
have the whole write operation inside the inode_lock.

Note that this is not some kind of locking requirement coming from
some ext4 locking rule: it only does this for the buffered path, the
DIO path happily says "if I'm not extending the size of the inode,
I'll just take the lock for shared access".

So this is literally only about buffered writes, and the atomicity
within a folio is already dealt with with the folio lock (ext4 uses
"generic_perform_write()" to do the actual writing part, which does
*not* care about that lock).

IOW, the whole inode lock is pointless for any real uses, but exists
because _technically_ POSIX does not allow reads to see partial writes
(ie if you see *one* page change, you have to see all of them change,
so per-folio locking is not sufficient - you technically need to lock
the whole operations against readers too, not just writers).

Of course, in real life absolutely nobody cares, and you can see
partial writes in many other ways, so this is really a completely
"because of paper standards" serialization.

Several other filesystems do *not* do that serialization, and as
mentioned, the DIO paths also just said "hey, this isn't a normal
write, so we'll ignore POSIX because technically we can".

But to take a different example, ext2 just calls
generic_file_write_iter() *without* taking the inode lock, and does
locking one page at a time. As far as I know, nobody ever really
complained.

(It's not just ext2. It's all the old filesystems: anything that uses
generic_file_write_iter() without doing inode_lock/unlock around it,
which is actually most of them).

And I find that silly locking rule a bit offensive, because it
literally serializes accesses that shouldn't be serialized.

IOW, it *should* be a "you do stupid things, you get what you
deserve". Instead, it's a "everybody pays the price for giving stupid
things a pointless semantic guarantee".

Afaik, it has a purely historical reason for it - all the original
UNIX read/write calls used a per-inode lock, so they were all
serialized whether you liked it or not.

Oh well. It's a pet peeve of mine.

              Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-26  1:32                 ` Linus Torvalds
@ 2024-02-26  1:58                   ` Kent Overstreet
  2024-02-26  2:06                     ` Kent Overstreet
  2024-02-26  2:34                     ` Linus Torvalds
  2024-02-26  2:50                   ` Al Viro
  1 sibling, 2 replies; 87+ messages in thread
From: Kent Overstreet @ 2024-02-26  1:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel,
	linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Sun, Feb 25, 2024 at 05:32:14PM -0800, Linus Torvalds wrote:
> On Sun, 25 Feb 2024 at 17:03, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > We could satisfy the posix atomic writes rule by just having a properly
> > vectorized buffered write path, no need for the inode lock - it really
> > should just be extending writes that have to hit the inode lock, same as
> > O_DIRECT.
> >
> > (whenever people bring up range locks, I keep trying to tell them - we
> > already have that in the form of the folio lock, if you'd just use it
> > properly...)
> 
> Sadly, that is *technically* not proper.
> 
> IOW, I actually agree with you that the folio lock is sufficient, and
> several filesystems do too.
> 
> BUT.
> 
> Technically, the POSIX requirements are that the atomicity of writes
> are "all or nothing" being visible, and so ext4, for example, will
> have the whole write operation inside the inode_lock.

...
 
> (It's not just ext2. It's all the old filesystems: anything that uses
> generic_file_write_iter() without doing inode_lock/unlock around it,
> which is actually most of them).

According to my reading just now, ext4 and btrfs (as well as bcachefs)
also don't take the inode lock in the read path - xfs is the only one
that does.

Perhaps we should just lift it to the VFS and make it controllable as a
mount/open option, as nice of a property as it is in theory I can't see
myself wanting to make everyone pay for it if ext4 and btrfs aren't
doing it and no one's screaming.

I think write vs. write consistency is the more interesting case; the
question there is does falling back to the inode lock when we can't lock
all the folios simultaneously work.

Consider thread A doing a 1 MB write, and it ends up in the path where
it locks the inode and it's allowed to write one folio at a time.

Then you have thread B doing some form of overlapping write, but without
the inode lock, and with all the folios locked simultaneously.

I think everything works; we need the end result to be consistent with
some total ordering of all the writes, IOW, thread B's write (if fully
within thread A's write) should be fully overwritten or not at all, and
that clearly is the case. But there may be situations involving more
than two threads where things get weirder.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-26  1:58                   ` Kent Overstreet
@ 2024-02-26  2:06                     ` Kent Overstreet
  2024-02-26  2:34                     ` Linus Torvalds
  1 sibling, 0 replies; 87+ messages in thread
From: Kent Overstreet @ 2024-02-26  2:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel,
	linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Chris Mason, Johannes Weiner,
	Suren Baghdasaryan

On Sun, Feb 25, 2024 at 08:58:28PM -0500, Kent Overstreet wrote:
> I think everything works; we need the end result to be consistent with
> some total ordering of all the writes, IOW, thread B's write (if fully
> within thread A's write) should be fully overwritten or not at all, and
> that clearly is the case. But there may be situations involving more
> than two threads where things get weirder.

Not sure if there's anything interesting or deep here, but this
situations where sometimes we've got a global lock and sometimes we've
got little locks do seem to come up - it reminded me of Suren's work on
the mmap sem recently, and I think I've seen it in other places. Feels
like it should at least have a name...

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-26  1:58                   ` Kent Overstreet
  2024-02-26  2:06                     ` Kent Overstreet
@ 2024-02-26  2:34                     ` Linus Torvalds
  1 sibling, 0 replies; 87+ messages in thread
From: Linus Torvalds @ 2024-02-26  2:34 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel,
	linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Sun, 25 Feb 2024 at 17:58, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> According to my reading just now, ext4 and btrfs (as well as bcachefs)
> also don't take the inode lock in the read path - xfs is the only one
> that does.

Yeah, I should have remembered that detail - DaveC has pointed it out
at some point how other filesystems don't actually honor the whole
"all or nothing visible to read".

And I was actually wrong about the common cases like ext2 - they use
generic_file_write_iter(), which does take that inode lock, and I was
confused with generic_perform_write() (which does not).

It was always the read side that didn't care, as you point out. It's
been some time since I looked at that.

But as mentioned, nobody has actually ever shown any real interest in
caring about the lack of POSIX technicality.

> I think write vs. write consistency is the more interesting case; the
> question there is does falling back to the inode lock when we can't lock
> all the folios simultaneously work.

I really don't think the write-write consistency is all that
interesting either, and it really does hurt.

If you're some toy database that would love to use buffered writes on
just a DB file, that "no concurrent writes" can hurt a lot. So then
people say "use DIO", but that has its own issues...

There is one obvious special case, and I think it's the primary one
why we end up having that inode_lock: O_APPEND or any other write
extending the size of the file.

THAT one obviously has to work right, and that's the case when
multiple writers actually do want to get write-write consistency, and
where it makes total sense to serialize them all.

That's the one case that even DIO cares about.

In the other cases, it's hard to argue that "one or the other wins the
whole range" is seriously hugely better than "one or the other wins at
some granularity". What the hell are you doing overlapping write
ranges for if you have a "one or the other" mentality?

Of course, maybe in practice it would be fine to do the "lock all the
folios, with the fallback being the inode lock" - and we could even
start with "all" being a pretty small number (perhaps starting with
"one" ;^).

                Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-26  1:32                 ` Linus Torvalds
  2024-02-26  1:58                   ` Kent Overstreet
@ 2024-02-26  2:50                   ` Al Viro
  2024-02-26 17:17                     ` Linus Torvalds
  1 sibling, 1 reply; 87+ messages in thread
From: Al Viro @ 2024-02-26  2:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kent Overstreet, Matthew Wilcox, Luis Chamberlain, lsf-pc,
	linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe,
	Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner

On Sun, Feb 25, 2024 at 05:32:14PM -0800, Linus Torvalds wrote:

> But to take a different example, ext2 just calls
> generic_file_write_iter() *without* taking the inode lock, and does
> locking one page at a time. As far as I know, nobody ever really
> complained.

Uh?  __generic_file_write_iter() doesn't take inode lock, but
generic_file_write_iter() does.  O_APPEND handling aside, there's
also file_remove_privs() in there and it does need that lock.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-23 23:59 [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO Luis Chamberlain
  2024-02-24  4:12 ` Matthew Wilcox
  2024-02-25  5:24 ` Kent Overstreet
@ 2024-02-26 12:22 ` Dave Chinner
  2024-02-27 10:07 ` Kent Overstreet
  3 siblings, 0 replies; 87+ messages in thread
From: Dave Chinner @ 2024-02-26 12:22 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Christoph Hellwig, Chris Mason, Johannes Weiner,
	Matthew Wilcox, Linus Torvalds

On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote:
> I recently ran a different type of simple test, focused on sequantial writes
> to fill capacity, with write workload essentially matching your RAM, so
> having parity with your RAM. Technically in the case of max size that I
> tested the writes were just *slightly* over the RAM, that's a minor
> technicality given I did other tests with similar sizes which showed similar
> results... This test should be possible to reproduce then if you have more
> than enough RAM to spare. In this case the system uses 1 TiB RAM, using
> pmem to avoid drive variance / GC / and other drive shenanigans.
> 
> So pmem grub setup:
> 
> memmap=500G!4G memmap=3G!504G
> 
> As noted earlier, surely, DIO / DAX is best for pmem (and I actually get
> a difference between using just DIO and DAX, but that digresses), but
> when one is wishing to test buffered IO on purpose it makes sense to do
> this. Yes, we can test tmpfs too... but I believe that topic will be
> brought up at LSFMM separately.  The delta with DIO and buffered IO on
> XFS is astronomical:
> 
>   ~86 GiB/s on pmem DIO on xfs with 64k block size, 1024 XFS agcount on x86_64
>      Vs
>  ~ 7,000 MiB/s with buffered IO

You're not testing apples to apples.

Buffered writes to the same superblock serialise on IO submission,
not write() calls, so it doesn't matter how much concurrency you
have in write() syscalls. That is, streaming buffered write
throughput is entirely limited by the number of IOs that 
the bdi flusher thread can submit.

For ext4, XFS and btrfs, delayed allocation means that this
writeback thread is also doing extent allocation for all IO, and
hence the single writeback thread for buffered writes is the
performance limiting factor for them.

It doesn't matter how fast you can copy in to the kernel, it can
only drain as fast as it can submit IO. As soon as this writeback
thread is CPU bound, incoming buffered write()s will be throttle
back to the rate at which memory can be cleaned by the writeback
thread.

Direct IO doesn't have this limitation - it's an orange in
comparison because IO is always submitted by the task that does the
write() syscall. Hence it inherently scales out to the limit of the
underlying hardware and it is not limited by the throughput of a
single CPU like page cache writeback is.

If you wonder why people are saying "issue sync_file_range()
periodically" to improved buffered write throughput, it's because it
moves the async writeback submission for that inode out of the
single background writeback thread and into task context where IO
submission can be trivially parallelised. Just like direct IO....

IOWs, the issue you are demonstrating is the inherent limitations in
single threaded write-behind cache flushing, and the solution to
that specific bottleneck is to enable concurrent writeback
submission from the same file and/or superblock via various
available manual mechanisms.

An automatic way of doing this for large streaming writes is switch
from write-behind to near-write-through, such that the majority of
write IO is submitted asynchronously from the write() syscall. Think
of how readahead from read() context pulls in data that is likely to
be needed soon - sequential writes should trigger similar behaviour
where we do async write-behind of the previous write()s in the
context of the current write. Track a sequential write window like
we do readahead, and trigger async writeback for such streaming
writes from the write() context...

That doesn't solve the huge tarball problem where we create millions
of small files in a couple of seconds, then have to wait for
single threaded writeback to drain them to the storage at 50,000
files/s. We can create files and get the data into the cache far
faster and with way more concurrency than the page cache can push
the data back to the storage itself.

IOWs, the problems with page cache write throughput really have
nothing to do with write() scalability, folios or filesystem block
sizes. The fundamental problem is single-threaded writeback IO
submission and that throttling incoming writes to whatever speed
it runs at when CPU bound....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-26  2:50                   ` Al Viro
@ 2024-02-26 17:17                     ` Linus Torvalds
  2024-02-26 21:07                       ` Matthew Wilcox
  2024-02-26 22:46                       ` Linus Torvalds
  0 siblings, 2 replies; 87+ messages in thread
From: Linus Torvalds @ 2024-02-26 17:17 UTC (permalink / raw)
  To: Al Viro
  Cc: Kent Overstreet, Matthew Wilcox, Luis Chamberlain, lsf-pc,
	linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe,
	Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner

On Sun, 25 Feb 2024 at 18:49, Al Viro <viro@kernel.org> wrote:
>
> Uh?  __generic_file_write_iter() doesn't take inode lock, but
> generic_file_write_iter() does.

Yes, as I replied to Kent later, I mis-remembered the details
(together with a too-quick grep). It's the reading side that just
doesn't care, and which causes us to not actually give the POSIX
guarantees anyway.

>  O_APPEND handling aside, there's
> also file_remove_privs() in there and it does need that lock.

Fair enough, but that's such a rare special case that it really
doesn't merit the lock on every write.

What at least the ext4 the DIO code does is something like "look at
the write position and the inode size, and decide optimistically
whether to take the lock shared or not". Then it re-checks it after
potentially taking it for a read (in case the inode size has changed),
and might decide to go for a write lock after all..

And I think we could fairly trivially do something similar on the
regular write side. Yes, it needs to check SUID/SGID bits in addition
to the inode size change, I guess (I don't think the ext4 dio code
does, but my quick grepping might once again be incomplete).

Anyway, DaveC is obviously also right that for the "actually need to
do writeback" case, our writeback is currently intentionally
throttled, which is why the benchmark by Luis shows that "almost two
orders of magnitude" slowdown with buffered writeback. That's probably
mainly an effect of having a backing store with no delays, but still..

However, the reason I dislike our write-side locking is that it
actually serializes even the entirely cached case.

Now, writing concurrently to the same inode is obviously strange, but
it's a common thing for databases. And while all the *serious* ones
use DIO, I think the buffered case really should do better.

Willy - tangential side note: I looked closer at the issue that you
reported (indirectly) with the small reads during heavy write
activity.

Our _reading_ side is very optimized and has none of the write-side
oddities that I can see, and we just have

  filemap_read ->
    filemap_get_pages ->
        filemap_get_read_batch ->
          folio_try_get_rcu()

and there is no page locking or other locking involved (assuming the
page is cached and marked uptodate etc, of course).

So afaik, it really is just that *one* atomic access (and the matching
page ref decrement afterwards).

We could easily do all of this without getting any ref to the page at
all if we did the page cache release with RCU (and the user copy with
"copy_to_user_atomic()").  Honestly, anything else looks like a
complete disaster. For tiny reads, a temporary buffer sounds ok, but
really *only* for tiny reads where we could have that buffer on the
stack.

Are tiny reads (handwaving: 100 bytes or less) really worth optimizing
for to that degree?

In contrast, the RCU-delaying of the page cache might be a good idea
in general. We've had other situations where that would have been
nice. The main worry would be low-memory situations, I suspect.

The "tiny read" optimization smells like a benchmark thing to me. Even
with the cacheline possibly bouncing, the system call overhead for
tiny reads (particularly with all the mitigations) should be orders of
magnitude higher than two atomic accesses.

           Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-26 17:17                     ` Linus Torvalds
@ 2024-02-26 21:07                       ` Matthew Wilcox
  2024-02-26 21:17                         ` Kent Overstreet
  2024-02-26 22:46                       ` Linus Torvalds
  1 sibling, 1 reply; 87+ messages in thread
From: Matthew Wilcox @ 2024-02-26 21:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Kent Overstreet, Luis Chamberlain, lsf-pc,
	linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe,
	Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner

On Mon, Feb 26, 2024 at 09:17:33AM -0800, Linus Torvalds wrote:
> Willy - tangential side note: I looked closer at the issue that you
> reported (indirectly) with the small reads during heavy write
> activity.
> 
> Our _reading_ side is very optimized and has none of the write-side
> oddities that I can see, and we just have
> 
>   filemap_read ->
>     filemap_get_pages ->
>         filemap_get_read_batch ->
>           folio_try_get_rcu()
> 
> and there is no page locking or other locking involved (assuming the
> page is cached and marked uptodate etc, of course).
> 
> So afaik, it really is just that *one* atomic access (and the matching
> page ref decrement afterwards).

Yep, that was what the customer reported on their ancient kernel, and
we at least didn't make that worse ...

> We could easily do all of this without getting any ref to the page at
> all if we did the page cache release with RCU (and the user copy with
> "copy_to_user_atomic()").  Honestly, anything else looks like a
> complete disaster. For tiny reads, a temporary buffer sounds ok, but
> really *only* for tiny reads where we could have that buffer on the
> stack.
> 
> Are tiny reads (handwaving: 100 bytes or less) really worth optimizing
> for to that degree?
> 
> In contrast, the RCU-delaying of the page cache might be a good idea
> in general. We've had other situations where that would have been
> nice. The main worry would be low-memory situations, I suspect.
> 
> The "tiny read" optimization smells like a benchmark thing to me. Even
> with the cacheline possibly bouncing, the system call overhead for
> tiny reads (particularly with all the mitigations) should be orders of
> magnitude higher than two atomic accesses.

Ah, good point about the $%^&^*^ mitigations.  This was pre mitigations.
I suspect that this customer would simply disable them; afaik the machine
is an appliance and one interacts with it purely by sending transactions
to it (it's not even an SQL system, much less a "run arbitrary javascript"
kind of system).  But that makes it even more special case, inapplicable
to the majority of workloads and closer to smelling like a benchmark.

I've thought about and rejected RCU delaying of the page cache in the
past.  With the majority of memory in anon memory & file memory, it just
feels too risky to have so much memory waiting to be reused.  We could
also improve gup-fast if we could rely on RCU freeing of anon memory.
Not sure what workloads might benefit from that, though.

It'd be cute if we could restrict free memory to be only reallocatable by
the process that had previously allocated it until an RCU grace period
had passed.  That way you could still snoop, but only on yourself which
wouldn't be all that exciting.  Doubt it'd be worth the effort of setting
up per-mm freelists, although it would allow us to skip zeroing the page
under some circumstances ...


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-26 21:07                       ` Matthew Wilcox
@ 2024-02-26 21:17                         ` Kent Overstreet
  2024-02-26 21:19                           ` Kent Overstreet
  0 siblings, 1 reply; 87+ messages in thread
From: Kent Overstreet @ 2024-02-26 21:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel,
	linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Mon, Feb 26, 2024 at 09:07:51PM +0000, Matthew Wilcox wrote:
> On Mon, Feb 26, 2024 at 09:17:33AM -0800, Linus Torvalds wrote:
> > Willy - tangential side note: I looked closer at the issue that you
> > reported (indirectly) with the small reads during heavy write
> > activity.
> > 
> > Our _reading_ side is very optimized and has none of the write-side
> > oddities that I can see, and we just have
> > 
> >   filemap_read ->
> >     filemap_get_pages ->
> >         filemap_get_read_batch ->
> >           folio_try_get_rcu()
> > 
> > and there is no page locking or other locking involved (assuming the
> > page is cached and marked uptodate etc, of course).
> > 
> > So afaik, it really is just that *one* atomic access (and the matching
> > page ref decrement afterwards).
> 
> Yep, that was what the customer reported on their ancient kernel, and
> we at least didn't make that worse ...
> 
> > We could easily do all of this without getting any ref to the page at
> > all if we did the page cache release with RCU (and the user copy with
> > "copy_to_user_atomic()").  Honestly, anything else looks like a
> > complete disaster. For tiny reads, a temporary buffer sounds ok, but
> > really *only* for tiny reads where we could have that buffer on the
> > stack.
> > 
> > Are tiny reads (handwaving: 100 bytes or less) really worth optimizing
> > for to that degree?
> > 
> > In contrast, the RCU-delaying of the page cache might be a good idea
> > in general. We've had other situations where that would have been
> > nice. The main worry would be low-memory situations, I suspect.
> > 
> > The "tiny read" optimization smells like a benchmark thing to me. Even
> > with the cacheline possibly bouncing, the system call overhead for
> > tiny reads (particularly with all the mitigations) should be orders of
> > magnitude higher than two atomic accesses.
> 
> Ah, good point about the $%^&^*^ mitigations.  This was pre mitigations.
> I suspect that this customer would simply disable them; afaik the machine
> is an appliance and one interacts with it purely by sending transactions
> to it (it's not even an SQL system, much less a "run arbitrary javascript"
> kind of system).  But that makes it even more special case, inapplicable
> to the majority of workloads and closer to smelling like a benchmark.
> 
> I've thought about and rejected RCU delaying of the page cache in the
> past.  With the majority of memory in anon memory & file memory, it just
> feels too risky to have so much memory waiting to be reused.  We could
> also improve gup-fast if we could rely on RCU freeing of anon memory.
> Not sure what workloads might benefit from that, though.

RCU allocating and freeing of memory can already be fairly significant
depending on workload, and I'd expect that to grow - we really just need
a way for reclaim to kick RCU when needed (and probably add a percpu
counter for "amount of memory stranded until the next RCU grace
period").

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-26 21:17                         ` Kent Overstreet
@ 2024-02-26 21:19                           ` Kent Overstreet
  2024-02-26 21:55                             ` Paul E. McKenney
  0 siblings, 1 reply; 87+ messages in thread
From: Kent Overstreet @ 2024-02-26 21:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel,
	linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Chris Mason, Johannes Weiner,
	Paul E. McKenney

+cc Paul

On Mon, Feb 26, 2024 at 04:17:19PM -0500, Kent Overstreet wrote:
> On Mon, Feb 26, 2024 at 09:07:51PM +0000, Matthew Wilcox wrote:
> > On Mon, Feb 26, 2024 at 09:17:33AM -0800, Linus Torvalds wrote:
> > > Willy - tangential side note: I looked closer at the issue that you
> > > reported (indirectly) with the small reads during heavy write
> > > activity.
> > > 
> > > Our _reading_ side is very optimized and has none of the write-side
> > > oddities that I can see, and we just have
> > > 
> > >   filemap_read ->
> > >     filemap_get_pages ->
> > >         filemap_get_read_batch ->
> > >           folio_try_get_rcu()
> > > 
> > > and there is no page locking or other locking involved (assuming the
> > > page is cached and marked uptodate etc, of course).
> > > 
> > > So afaik, it really is just that *one* atomic access (and the matching
> > > page ref decrement afterwards).
> > 
> > Yep, that was what the customer reported on their ancient kernel, and
> > we at least didn't make that worse ...
> > 
> > > We could easily do all of this without getting any ref to the page at
> > > all if we did the page cache release with RCU (and the user copy with
> > > "copy_to_user_atomic()").  Honestly, anything else looks like a
> > > complete disaster. For tiny reads, a temporary buffer sounds ok, but
> > > really *only* for tiny reads where we could have that buffer on the
> > > stack.
> > > 
> > > Are tiny reads (handwaving: 100 bytes or less) really worth optimizing
> > > for to that degree?
> > > 
> > > In contrast, the RCU-delaying of the page cache might be a good idea
> > > in general. We've had other situations where that would have been
> > > nice. The main worry would be low-memory situations, I suspect.
> > > 
> > > The "tiny read" optimization smells like a benchmark thing to me. Even
> > > with the cacheline possibly bouncing, the system call overhead for
> > > tiny reads (particularly with all the mitigations) should be orders of
> > > magnitude higher than two atomic accesses.
> > 
> > Ah, good point about the $%^&^*^ mitigations.  This was pre mitigations.
> > I suspect that this customer would simply disable them; afaik the machine
> > is an appliance and one interacts with it purely by sending transactions
> > to it (it's not even an SQL system, much less a "run arbitrary javascript"
> > kind of system).  But that makes it even more special case, inapplicable
> > to the majority of workloads and closer to smelling like a benchmark.
> > 
> > I've thought about and rejected RCU delaying of the page cache in the
> > past.  With the majority of memory in anon memory & file memory, it just
> > feels too risky to have so much memory waiting to be reused.  We could
> > also improve gup-fast if we could rely on RCU freeing of anon memory.
> > Not sure what workloads might benefit from that, though.
> 
> RCU allocating and freeing of memory can already be fairly significant
> depending on workload, and I'd expect that to grow - we really just need
> a way for reclaim to kick RCU when needed (and probably add a percpu
> counter for "amount of memory stranded until the next RCU grace
> period").

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-26 21:19                           ` Kent Overstreet
@ 2024-02-26 21:55                             ` Paul E. McKenney
  2024-02-26 23:29                               ` Kent Overstreet
  0 siblings, 1 reply; 87+ messages in thread
From: Paul E. McKenney @ 2024-02-26 21:55 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain,
	lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Mon, Feb 26, 2024 at 04:19:14PM -0500, Kent Overstreet wrote:
> +cc Paul
> 
> On Mon, Feb 26, 2024 at 04:17:19PM -0500, Kent Overstreet wrote:
> > On Mon, Feb 26, 2024 at 09:07:51PM +0000, Matthew Wilcox wrote:
> > > On Mon, Feb 26, 2024 at 09:17:33AM -0800, Linus Torvalds wrote:
> > > > Willy - tangential side note: I looked closer at the issue that you
> > > > reported (indirectly) with the small reads during heavy write
> > > > activity.
> > > > 
> > > > Our _reading_ side is very optimized and has none of the write-side
> > > > oddities that I can see, and we just have
> > > > 
> > > >   filemap_read ->
> > > >     filemap_get_pages ->
> > > >         filemap_get_read_batch ->
> > > >           folio_try_get_rcu()
> > > > 
> > > > and there is no page locking or other locking involved (assuming the
> > > > page is cached and marked uptodate etc, of course).
> > > > 
> > > > So afaik, it really is just that *one* atomic access (and the matching
> > > > page ref decrement afterwards).
> > > 
> > > Yep, that was what the customer reported on their ancient kernel, and
> > > we at least didn't make that worse ...
> > > 
> > > > We could easily do all of this without getting any ref to the page at
> > > > all if we did the page cache release with RCU (and the user copy with
> > > > "copy_to_user_atomic()").  Honestly, anything else looks like a
> > > > complete disaster. For tiny reads, a temporary buffer sounds ok, but
> > > > really *only* for tiny reads where we could have that buffer on the
> > > > stack.
> > > > 
> > > > Are tiny reads (handwaving: 100 bytes or less) really worth optimizing
> > > > for to that degree?
> > > > 
> > > > In contrast, the RCU-delaying of the page cache might be a good idea
> > > > in general. We've had other situations where that would have been
> > > > nice. The main worry would be low-memory situations, I suspect.
> > > > 
> > > > The "tiny read" optimization smells like a benchmark thing to me. Even
> > > > with the cacheline possibly bouncing, the system call overhead for
> > > > tiny reads (particularly with all the mitigations) should be orders of
> > > > magnitude higher than two atomic accesses.
> > > 
> > > Ah, good point about the $%^&^*^ mitigations.  This was pre mitigations.
> > > I suspect that this customer would simply disable them; afaik the machine
> > > is an appliance and one interacts with it purely by sending transactions
> > > to it (it's not even an SQL system, much less a "run arbitrary javascript"
> > > kind of system).  But that makes it even more special case, inapplicable
> > > to the majority of workloads and closer to smelling like a benchmark.
> > > 
> > > I've thought about and rejected RCU delaying of the page cache in the
> > > past.  With the majority of memory in anon memory & file memory, it just
> > > feels too risky to have so much memory waiting to be reused.  We could
> > > also improve gup-fast if we could rely on RCU freeing of anon memory.
> > > Not sure what workloads might benefit from that, though.
> > 
> > RCU allocating and freeing of memory can already be fairly significant
> > depending on workload, and I'd expect that to grow - we really just need
> > a way for reclaim to kick RCU when needed (and probably add a percpu
> > counter for "amount of memory stranded until the next RCU grace
> > period").

There are some APIs for that, though the are sharp-edged and mainly
intended for rcutorture, and there are some hooks for a CI Kconfig
option called RCU_STRICT_GRACE_PERIOD that could be organized into
something useful.

Of course, if there is a long-running RCU reader, there is nothing
RCU can do.  By definition, it must wait on all pre-existing readers,
no exceptions.

But my guess is that you instead are thinking of memory-exhaustion
emergencies where you would like RCU to burn more CPU than usual to
reduce grace-period latency, there are definitely things that can be done.

I am sure that there are more questions that I should ask, but the one
that comes immediately to mind is "Is this API call an occasional thing,
or does RCU need to tolerate many CPUs hammering it frequently?"
Either answer is fine, I just need to know.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-26 17:17                     ` Linus Torvalds
  2024-02-26 21:07                       ` Matthew Wilcox
@ 2024-02-26 22:46                       ` Linus Torvalds
  2024-02-26 23:48                         ` Linus Torvalds
  2024-05-14 11:52                         ` Luis Chamberlain
  1 sibling, 2 replies; 87+ messages in thread
From: Linus Torvalds @ 2024-02-26 22:46 UTC (permalink / raw)
  To: Al Viro
  Cc: Kent Overstreet, Matthew Wilcox, Luis Chamberlain, lsf-pc,
	linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe,
	Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner

[-- Attachment #1: Type: text/plain, Size: 1147 bytes --]

On Mon, 26 Feb 2024 at 09:17, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Are tiny reads (handwaving: 100 bytes or less) really worth optimizing
> for to that degree?

So this ended up just bothering me, and dammit, we already *have* that
stack buffer, except it's that 'folio batch' buffer.

Which is already 128 bytes of stack space (on 64-bit).

So, just hypothetically, let's say that *before* we start using that
folio batch buffer for folios, we use it as a dummy buffer for small
reads.

So we'd make that 'fbatch' thing be a union with a temporary byte buffer.

That hypothetical patch might look something like this TOTALLY UNTESTED CRAP.

Anybody interested in seeing if something like this might actually
work? I do want to emphasize the "something like this".

This pile of random thoughts ends up compiling for me, and I _tried_
to think of all the cases, but there might be obvious thinkos, and
there might be things I just didn't think about at all.

I really haven't tested this AT ALL. I'm much too scared. But I don't
actually hate how the code looks nearly as much as I *thought* I'd
hate it.

              Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 5033 bytes --]

 mm/filemap.c | 117 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 107 insertions(+), 10 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 750e779c23db..5f66fd1bc82d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2543,6 +2543,85 @@ static inline bool pos_same_folio(loff_t pos1, loff_t pos2, struct folio *folio)
 	return (pos1 >> shift == pos2 >> shift);
 }
 
+/*
+ * I can't be bothered to care about HIGHMEM for the fast read case
+ */
+#ifdef CONFIG_HIGHMEM
+#define filemap_fast_read(mapping, pos, buffer, size) 0
+#else
+
+/*
+ * Called under RCU with size limited to the file size and one
+ */
+static unsigned long filemap_folio_copy_rcu(struct address_space *mapping, loff_t pos, char *buffer, size_t size)
+{
+	XA_STATE(xas, &mapping->i_pages, pos >> PAGE_SHIFT);
+	struct folio *folio;
+	size_t offset;
+
+	xas_reset(&xas);
+	folio = xas_load(&xas);
+	if (xas_retry(&xas, folio))
+		return 0;
+
+	if (!folio || xa_is_value(folio))
+		return 0;
+
+	if (!folio_test_uptodate(folio))
+		return 0;
+
+	/* No fast-case if we are supposed to start readahead */
+	if (folio_test_readahead(folio))
+		return 0;
+	/* .. or mark it accessed */
+	if (!folio_test_referenced(folio))
+		return 0;
+
+	/* Do the data copy */
+	offset = pos & (folio_size(folio) - 1);
+	memcpy(buffer, folio_address(folio) + offset, size);
+
+	/* We should probably do some silly memory barrier here */
+	if (unlikely(folio != xas_reload(&xas)))
+		return 0;
+
+	return size;
+}
+
+/*
+ * Iff we can complete the read completely in one atomic go under RCU,
+ * do so here. Otherwise return 0 (no partial reads, please - this is
+ * purely for the trivial fast case).
+ */
+static unsigned long filemap_fast_read(struct address_space *mapping, loff_t pos, char *buffer, size_t size)
+{
+	struct inode *inode;
+	loff_t file_size;
+	unsigned long pgoff;
+
+	/* Don't even try for page-crossers */
+	pgoff = pos & ~PAGE_MASK;
+	if (pgoff + size > PAGE_SIZE)
+		return 0;
+
+	/* Limit it to the file size */
+	inode = mapping->host;
+	file_size = i_size_read(inode);
+	if (unlikely(pos >= file_size))
+		return 0;
+	file_size -= pos;
+	if (file_size < size)
+		size = file_size;
+
+	/* Let's see if we can just do the read under RCU */
+	rcu_read_lock();
+	size = filemap_folio_copy_rcu(mapping, pos, buffer, size);
+	rcu_read_unlock();
+
+	return size;
+}
+#endif /* !HIGHMEM */
+
 /**
  * filemap_read - Read data from the page cache.
  * @iocb: The iocb to read.
@@ -2563,7 +2642,10 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
 	struct file_ra_state *ra = &filp->f_ra;
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
-	struct folio_batch fbatch;
+	union {
+		struct folio_batch fbatch;
+		__DECLARE_FLEX_ARRAY(char, buffer);
+	} area;
 	int i, error = 0;
 	bool writably_mapped;
 	loff_t isize, end_offset;
@@ -2575,7 +2657,22 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
 		return 0;
 
 	iov_iter_truncate(iter, inode->i_sb->s_maxbytes);
-	folio_batch_init(&fbatch);
+
+	if (iov_iter_count(iter) < sizeof(area)) {
+		unsigned long count = iov_iter_count(iter);
+
+		count = filemap_fast_read(mapping, iocb->ki_pos, area.buffer, count);
+		if (count) {
+			size_t copied = copy_to_iter(area.buffer, count, iter);
+			if (unlikely(!copied))
+				return already_read ? already_read : -EFAULT;
+			ra->prev_pos = iocb->ki_pos += copied;
+			file_accessed(filp);
+			return copied + already_read;
+		}
+	}
+
+	folio_batch_init(&area.fbatch);
 
 	do {
 		cond_resched();
@@ -2591,7 +2688,7 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
 		if (unlikely(iocb->ki_pos >= i_size_read(inode)))
 			break;
 
-		error = filemap_get_pages(iocb, iter->count, &fbatch, false);
+		error = filemap_get_pages(iocb, iter->count, &area.fbatch, false);
 		if (error < 0)
 			break;
 
@@ -2628,11 +2725,11 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
 		 * mark it as accessed the first time.
 		 */
 		if (!pos_same_folio(iocb->ki_pos, last_pos - 1,
-				    fbatch.folios[0]))
-			folio_mark_accessed(fbatch.folios[0]);
+				    area.fbatch.folios[0]))
+			folio_mark_accessed(area.fbatch.folios[0]);
 
-		for (i = 0; i < folio_batch_count(&fbatch); i++) {
-			struct folio *folio = fbatch.folios[i];
+		for (i = 0; i < folio_batch_count(&area.fbatch); i++) {
+			struct folio *folio = area.fbatch.folios[i];
 			size_t fsize = folio_size(folio);
 			size_t offset = iocb->ki_pos & (fsize - 1);
 			size_t bytes = min_t(loff_t, end_offset - iocb->ki_pos,
@@ -2663,9 +2760,9 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
 			}
 		}
 put_folios:
-		for (i = 0; i < folio_batch_count(&fbatch); i++)
-			folio_put(fbatch.folios[i]);
-		folio_batch_init(&fbatch);
+		for (i = 0; i < folio_batch_count(&area.fbatch); i++)
+			folio_put(area.fbatch.folios[i]);
+		folio_batch_init(&area.fbatch);
 	} while (iov_iter_count(iter) && iocb->ki_pos < isize && !error);
 
 	file_accessed(filp);

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-26 21:55                             ` Paul E. McKenney
@ 2024-02-26 23:29                               ` Kent Overstreet
  2024-02-27  0:05                                 ` Paul E. McKenney
  2024-02-27  0:43                                 ` Dave Chinner
  0 siblings, 2 replies; 87+ messages in thread
From: Kent Overstreet @ 2024-02-26 23:29 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain,
	lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Mon, Feb 26, 2024 at 01:55:10PM -0800, Paul E. McKenney wrote:
> On Mon, Feb 26, 2024 at 04:19:14PM -0500, Kent Overstreet wrote:
> > +cc Paul
> > 
> > On Mon, Feb 26, 2024 at 04:17:19PM -0500, Kent Overstreet wrote:
> > > On Mon, Feb 26, 2024 at 09:07:51PM +0000, Matthew Wilcox wrote:
> > > > On Mon, Feb 26, 2024 at 09:17:33AM -0800, Linus Torvalds wrote:
> > > > > Willy - tangential side note: I looked closer at the issue that you
> > > > > reported (indirectly) with the small reads during heavy write
> > > > > activity.
> > > > > 
> > > > > Our _reading_ side is very optimized and has none of the write-side
> > > > > oddities that I can see, and we just have
> > > > > 
> > > > >   filemap_read ->
> > > > >     filemap_get_pages ->
> > > > >         filemap_get_read_batch ->
> > > > >           folio_try_get_rcu()
> > > > > 
> > > > > and there is no page locking or other locking involved (assuming the
> > > > > page is cached and marked uptodate etc, of course).
> > > > > 
> > > > > So afaik, it really is just that *one* atomic access (and the matching
> > > > > page ref decrement afterwards).
> > > > 
> > > > Yep, that was what the customer reported on their ancient kernel, and
> > > > we at least didn't make that worse ...
> > > > 
> > > > > We could easily do all of this without getting any ref to the page at
> > > > > all if we did the page cache release with RCU (and the user copy with
> > > > > "copy_to_user_atomic()").  Honestly, anything else looks like a
> > > > > complete disaster. For tiny reads, a temporary buffer sounds ok, but
> > > > > really *only* for tiny reads where we could have that buffer on the
> > > > > stack.
> > > > > 
> > > > > Are tiny reads (handwaving: 100 bytes or less) really worth optimizing
> > > > > for to that degree?
> > > > > 
> > > > > In contrast, the RCU-delaying of the page cache might be a good idea
> > > > > in general. We've had other situations where that would have been
> > > > > nice. The main worry would be low-memory situations, I suspect.
> > > > > 
> > > > > The "tiny read" optimization smells like a benchmark thing to me. Even
> > > > > with the cacheline possibly bouncing, the system call overhead for
> > > > > tiny reads (particularly with all the mitigations) should be orders of
> > > > > magnitude higher than two atomic accesses.
> > > > 
> > > > Ah, good point about the $%^&^*^ mitigations.  This was pre mitigations.
> > > > I suspect that this customer would simply disable them; afaik the machine
> > > > is an appliance and one interacts with it purely by sending transactions
> > > > to it (it's not even an SQL system, much less a "run arbitrary javascript"
> > > > kind of system).  But that makes it even more special case, inapplicable
> > > > to the majority of workloads and closer to smelling like a benchmark.
> > > > 
> > > > I've thought about and rejected RCU delaying of the page cache in the
> > > > past.  With the majority of memory in anon memory & file memory, it just
> > > > feels too risky to have so much memory waiting to be reused.  We could
> > > > also improve gup-fast if we could rely on RCU freeing of anon memory.
> > > > Not sure what workloads might benefit from that, though.
> > > 
> > > RCU allocating and freeing of memory can already be fairly significant
> > > depending on workload, and I'd expect that to grow - we really just need
> > > a way for reclaim to kick RCU when needed (and probably add a percpu
> > > counter for "amount of memory stranded until the next RCU grace
> > > period").
> 
> There are some APIs for that, though the are sharp-edged and mainly
> intended for rcutorture, and there are some hooks for a CI Kconfig
> option called RCU_STRICT_GRACE_PERIOD that could be organized into
> something useful.
> 
> Of course, if there is a long-running RCU reader, there is nothing
> RCU can do.  By definition, it must wait on all pre-existing readers,
> no exceptions.
> 
> But my guess is that you instead are thinking of memory-exhaustion
> emergencies where you would like RCU to burn more CPU than usual to
> reduce grace-period latency, there are definitely things that can be done.
> 
> I am sure that there are more questions that I should ask, but the one
> that comes immediately to mind is "Is this API call an occasional thing,
> or does RCU need to tolerate many CPUs hammering it frequently?"
> Either answer is fine, I just need to know.  ;-)

Well, we won't want it getting hammered on continuously - we should be
able to tune reclaim so that doesn't happen.

I think getting numbers on the amount of memory stranded waiting for RCU
is probably first order of business - minor tweak to kfree_rcu() et all
for that; there's APIs they can query to maintain that counter.

then, we can add a heuristic threshhold somewhere, something like 

if (rcu_stranded * multiplier > reclaimable_memory)
	kick_rcu()

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-26 22:46                       ` Linus Torvalds
@ 2024-02-26 23:48                         ` Linus Torvalds
  2024-02-27  7:21                           ` Kent Overstreet
  2024-05-14 11:52                         ` Luis Chamberlain
  1 sibling, 1 reply; 87+ messages in thread
From: Linus Torvalds @ 2024-02-26 23:48 UTC (permalink / raw)
  To: Al Viro
  Cc: Kent Overstreet, Matthew Wilcox, Luis Chamberlain, lsf-pc,
	linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe,
	Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner

On Mon, 26 Feb 2024 at 14:46, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I really haven't tested this AT ALL. I'm much too scared.

"Courage is not the absence of fear, but acting in spite of it"
         - Paddington Bear / Michal Scott

It seems to actually boot here.

That said, from a quick test with lots of threads all hammering on the
same page - I'm still not entirely convinced it makes a difference.
Sure, the kernel profile changes, but filemap_get_read_batch() wasn't
very high up in the profile to begin with.

I didn't do any actual performance testing, I just did a 64-byte pread
at offset 0 in a loop in 64 threads on my 32c/64t machine.

The cache ping-pong would be a lot more noticeable on some
multi-socket machine, of course, but I do get the feeling that this is
all optimizing for such an edge-case of an edge-case that it's all a
bit questionable.

But that patch does largely seem to work. Famous last words. It really
needs a lot more sanity checking, and that comment about probably
needing a memory barrier is still valid.

And even then there's the question about replacing the same folio in
the same spot in the xarray. I'm not convinced it is worth worrying
about in any reality we care about, but it's _technically_ all a bit
wrong.

So I'm throwing that patch over the fence to somebody that cares. I
_do_ now claim it at least kind of works.

                Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-26 23:29                               ` Kent Overstreet
@ 2024-02-27  0:05                                 ` Paul E. McKenney
  2024-02-27  0:29                                   ` Kent Overstreet
  2024-02-27  0:43                                 ` Dave Chinner
  1 sibling, 1 reply; 87+ messages in thread
From: Paul E. McKenney @ 2024-02-27  0:05 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain,
	lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Mon, Feb 26, 2024 at 06:29:43PM -0500, Kent Overstreet wrote:
> On Mon, Feb 26, 2024 at 01:55:10PM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 26, 2024 at 04:19:14PM -0500, Kent Overstreet wrote:
> > > +cc Paul
> > > 
> > > On Mon, Feb 26, 2024 at 04:17:19PM -0500, Kent Overstreet wrote:
> > > > On Mon, Feb 26, 2024 at 09:07:51PM +0000, Matthew Wilcox wrote:
> > > > > On Mon, Feb 26, 2024 at 09:17:33AM -0800, Linus Torvalds wrote:
> > > > > > Willy - tangential side note: I looked closer at the issue that you
> > > > > > reported (indirectly) with the small reads during heavy write
> > > > > > activity.
> > > > > > 
> > > > > > Our _reading_ side is very optimized and has none of the write-side
> > > > > > oddities that I can see, and we just have
> > > > > > 
> > > > > >   filemap_read ->
> > > > > >     filemap_get_pages ->
> > > > > >         filemap_get_read_batch ->
> > > > > >           folio_try_get_rcu()
> > > > > > 
> > > > > > and there is no page locking or other locking involved (assuming the
> > > > > > page is cached and marked uptodate etc, of course).
> > > > > > 
> > > > > > So afaik, it really is just that *one* atomic access (and the matching
> > > > > > page ref decrement afterwards).
> > > > > 
> > > > > Yep, that was what the customer reported on their ancient kernel, and
> > > > > we at least didn't make that worse ...
> > > > > 
> > > > > > We could easily do all of this without getting any ref to the page at
> > > > > > all if we did the page cache release with RCU (and the user copy with
> > > > > > "copy_to_user_atomic()").  Honestly, anything else looks like a
> > > > > > complete disaster. For tiny reads, a temporary buffer sounds ok, but
> > > > > > really *only* for tiny reads where we could have that buffer on the
> > > > > > stack.
> > > > > > 
> > > > > > Are tiny reads (handwaving: 100 bytes or less) really worth optimizing
> > > > > > for to that degree?
> > > > > > 
> > > > > > In contrast, the RCU-delaying of the page cache might be a good idea
> > > > > > in general. We've had other situations where that would have been
> > > > > > nice. The main worry would be low-memory situations, I suspect.
> > > > > > 
> > > > > > The "tiny read" optimization smells like a benchmark thing to me. Even
> > > > > > with the cacheline possibly bouncing, the system call overhead for
> > > > > > tiny reads (particularly with all the mitigations) should be orders of
> > > > > > magnitude higher than two atomic accesses.
> > > > > 
> > > > > Ah, good point about the $%^&^*^ mitigations.  This was pre mitigations.
> > > > > I suspect that this customer would simply disable them; afaik the machine
> > > > > is an appliance and one interacts with it purely by sending transactions
> > > > > to it (it's not even an SQL system, much less a "run arbitrary javascript"
> > > > > kind of system).  But that makes it even more special case, inapplicable
> > > > > to the majority of workloads and closer to smelling like a benchmark.
> > > > > 
> > > > > I've thought about and rejected RCU delaying of the page cache in the
> > > > > past.  With the majority of memory in anon memory & file memory, it just
> > > > > feels too risky to have so much memory waiting to be reused.  We could
> > > > > also improve gup-fast if we could rely on RCU freeing of anon memory.
> > > > > Not sure what workloads might benefit from that, though.
> > > > 
> > > > RCU allocating and freeing of memory can already be fairly significant
> > > > depending on workload, and I'd expect that to grow - we really just need
> > > > a way for reclaim to kick RCU when needed (and probably add a percpu
> > > > counter for "amount of memory stranded until the next RCU grace
> > > > period").
> > 
> > There are some APIs for that, though the are sharp-edged and mainly
> > intended for rcutorture, and there are some hooks for a CI Kconfig
> > option called RCU_STRICT_GRACE_PERIOD that could be organized into
> > something useful.
> > 
> > Of course, if there is a long-running RCU reader, there is nothing
> > RCU can do.  By definition, it must wait on all pre-existing readers,
> > no exceptions.
> > 
> > But my guess is that you instead are thinking of memory-exhaustion
> > emergencies where you would like RCU to burn more CPU than usual to
> > reduce grace-period latency, there are definitely things that can be done.
> > 
> > I am sure that there are more questions that I should ask, but the one
> > that comes immediately to mind is "Is this API call an occasional thing,
> > or does RCU need to tolerate many CPUs hammering it frequently?"
> > Either answer is fine, I just need to know.  ;-)
> 
> Well, we won't want it getting hammered on continuously - we should be
> able to tune reclaim so that doesn't happen.
> 
> I think getting numbers on the amount of memory stranded waiting for RCU
> is probably first order of business - minor tweak to kfree_rcu() et all
> for that; there's APIs they can query to maintain that counter.

We can easily tell you the number of blocks of memory waiting to be freed.
But RCU does not know their size.  Yes, we could ferret this on each
call to kmem_free_rcu(), but that might not be great for performance.
We could traverse the lists at runtime, but such traversal must be done
with interrupts disabled, which is also not great.

> then, we can add a heuristic threshhold somewhere, something like 
> 
> if (rcu_stranded * multiplier > reclaimable_memory)
> 	kick_rcu()

If it is a heuristic anyway, it sounds best to base the heuristic on
the number of objects rather than their aggregate size.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27  0:05                                 ` Paul E. McKenney
@ 2024-02-27  0:29                                   ` Kent Overstreet
  2024-02-27  0:55                                     ` Paul E. McKenney
  0 siblings, 1 reply; 87+ messages in thread
From: Kent Overstreet @ 2024-02-27  0:29 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain,
	lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Mon, Feb 26, 2024 at 04:05:37PM -0800, Paul E. McKenney wrote:
> On Mon, Feb 26, 2024 at 06:29:43PM -0500, Kent Overstreet wrote:
> > Well, we won't want it getting hammered on continuously - we should be
> > able to tune reclaim so that doesn't happen.
> > 
> > I think getting numbers on the amount of memory stranded waiting for RCU
> > is probably first order of business - minor tweak to kfree_rcu() et all
> > for that; there's APIs they can query to maintain that counter.
> 
> We can easily tell you the number of blocks of memory waiting to be freed.
> But RCU does not know their size.  Yes, we could ferret this on each
> call to kmem_free_rcu(), but that might not be great for performance.
> We could traverse the lists at runtime, but such traversal must be done
> with interrupts disabled, which is also not great.
> 
> > then, we can add a heuristic threshhold somewhere, something like 
> > 
> > if (rcu_stranded * multiplier > reclaimable_memory)
> > 	kick_rcu()
> 
> If it is a heuristic anyway, it sounds best to base the heuristic on
> the number of objects rather than their aggregate size.

I don't think that'll really work given that object size can very from <
100 bytes all the way up to 2MB hugepages. The shrinker API works that
way and I positively hate it; it's really helpful for introspection and
debugability later to give good human understandable units to this
stuff.

And __ksize() is pretty cheap, and I think there might be room in struct
slab to stick the object size there instead of getting it from the slab
cache - and folio_size() is cheaper still.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-26 23:29                               ` Kent Overstreet
  2024-02-27  0:05                                 ` Paul E. McKenney
@ 2024-02-27  0:43                                 ` Dave Chinner
  1 sibling, 0 replies; 87+ messages in thread
From: Dave Chinner @ 2024-02-27  0:43 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Paul E. McKenney, Matthew Wilcox, Linus Torvalds, Al Viro,
	Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez,
	Pankaj Raghav, Jens Axboe, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Mon, Feb 26, 2024 at 06:29:43PM -0500, Kent Overstreet wrote:
> On Mon, Feb 26, 2024 at 01:55:10PM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 26, 2024 at 04:19:14PM -0500, Kent Overstreet wrote:
> > > > RCU allocating and freeing of memory can already be fairly significant
> > > > depending on workload, and I'd expect that to grow - we really just need
> > > > a way for reclaim to kick RCU when needed (and probably add a percpu
> > > > counter for "amount of memory stranded until the next RCU grace
> > > > period").
> > 
> > There are some APIs for that, though the are sharp-edged and mainly
> > intended for rcutorture, and there are some hooks for a CI Kconfig
> > option called RCU_STRICT_GRACE_PERIOD that could be organized into
> > something useful.
> > 
> > Of course, if there is a long-running RCU reader, there is nothing
> > RCU can do.  By definition, it must wait on all pre-existing readers,
> > no exceptions.
> > 
> > But my guess is that you instead are thinking of memory-exhaustion
> > emergencies where you would like RCU to burn more CPU than usual to
> > reduce grace-period latency, there are definitely things that can be done.
> > 
> > I am sure that there are more questions that I should ask, but the one
> > that comes immediately to mind is "Is this API call an occasional thing,
> > or does RCU need to tolerate many CPUs hammering it frequently?"
> > Either answer is fine, I just need to know.  ;-)
> 
> Well, we won't want it getting hammered on continuously - we should be
> able to tune reclaim so that doesn't happen.

If we are under sustained memory pressure, there will be a
relatively steady state of "stranded memory" - every rcu grace
period will be stranding and freeing roughly the same amount of
memory because that reclaim progress across all caches won't change
significantly from grace period to grace period.

I really haven't seen "stranded memory" from reclaimable slab caches
(like inodes and dentries) ever causing issues with allocation or
causing OOM kills.  Hence I'm not sure that there is any real need
for expediting the freeing of RCU memory in the general case - it's
probably only when we get near OOM (i.e. reclaim priority is
approaching 0) that expediting rcu_free()d memory may make any
difference to allocation success...

> I think getting numbers on the amount of memory stranded waiting for RCU
> is probably first order of business - minor tweak to kfree_rcu() et all
> for that; there's APIs they can query to maintain that counter.

Yes, please. Get some numbers that show there is an actual problem
here that needs solving.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27  0:29                                   ` Kent Overstreet
@ 2024-02-27  0:55                                     ` Paul E. McKenney
  2024-02-27  1:08                                       ` Kent Overstreet
  0 siblings, 1 reply; 87+ messages in thread
From: Paul E. McKenney @ 2024-02-27  0:55 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain,
	lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Mon, Feb 26, 2024 at 07:29:04PM -0500, Kent Overstreet wrote:
> On Mon, Feb 26, 2024 at 04:05:37PM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 26, 2024 at 06:29:43PM -0500, Kent Overstreet wrote:
> > > Well, we won't want it getting hammered on continuously - we should be
> > > able to tune reclaim so that doesn't happen.
> > > 
> > > I think getting numbers on the amount of memory stranded waiting for RCU
> > > is probably first order of business - minor tweak to kfree_rcu() et all
> > > for that; there's APIs they can query to maintain that counter.
> > 
> > We can easily tell you the number of blocks of memory waiting to be freed.
> > But RCU does not know their size.  Yes, we could ferret this on each
> > call to kmem_free_rcu(), but that might not be great for performance.
> > We could traverse the lists at runtime, but such traversal must be done
> > with interrupts disabled, which is also not great.
> > 
> > > then, we can add a heuristic threshhold somewhere, something like 
> > > 
> > > if (rcu_stranded * multiplier > reclaimable_memory)
> > > 	kick_rcu()
> > 
> > If it is a heuristic anyway, it sounds best to base the heuristic on
> > the number of objects rather than their aggregate size.
> 
> I don't think that'll really work given that object size can very from <
> 100 bytes all the way up to 2MB hugepages. The shrinker API works that
> way and I positively hate it; it's really helpful for introspection and
> debugability later to give good human understandable units to this
> stuff.

You might well be right, but let's please try it before adding overhead to
kfree_rcu() and friends.  I bet it will prove to be good and sufficient.

> And __ksize() is pretty cheap, and I think there might be room in struct
> slab to stick the object size there instead of getting it from the slab
> cache - and folio_size() is cheaper still.

On __ksize():

 * This should only be used internally to query the true size of allocations.
 * It is not meant to be a way to discover the usable size of an allocation
 * after the fact. Instead, use kmalloc_size_roundup().

Except that kmalloc_size_roundup() doesn't look like it is meant for
this use case.  On __ksize() being used only internally, I would not be
at all averse to kfree_rcu() and friends moving to mm.

The idea is for kfree_rcu() to invoke __ksize() when given slab memory
and folio_size() when given vmalloc() memory?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27  0:55                                     ` Paul E. McKenney
@ 2024-02-27  1:08                                       ` Kent Overstreet
  2024-02-27  5:17                                         ` Paul E. McKenney
  0 siblings, 1 reply; 87+ messages in thread
From: Kent Overstreet @ 2024-02-27  1:08 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain,
	lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Mon, Feb 26, 2024 at 04:55:29PM -0800, Paul E. McKenney wrote:
> On Mon, Feb 26, 2024 at 07:29:04PM -0500, Kent Overstreet wrote:
> > On Mon, Feb 26, 2024 at 04:05:37PM -0800, Paul E. McKenney wrote:
> > > On Mon, Feb 26, 2024 at 06:29:43PM -0500, Kent Overstreet wrote:
> > > > Well, we won't want it getting hammered on continuously - we should be
> > > > able to tune reclaim so that doesn't happen.
> > > > 
> > > > I think getting numbers on the amount of memory stranded waiting for RCU
> > > > is probably first order of business - minor tweak to kfree_rcu() et all
> > > > for that; there's APIs they can query to maintain that counter.
> > > 
> > > We can easily tell you the number of blocks of memory waiting to be freed.
> > > But RCU does not know their size.  Yes, we could ferret this on each
> > > call to kmem_free_rcu(), but that might not be great for performance.
> > > We could traverse the lists at runtime, but such traversal must be done
> > > with interrupts disabled, which is also not great.
> > > 
> > > > then, we can add a heuristic threshhold somewhere, something like 
> > > > 
> > > > if (rcu_stranded * multiplier > reclaimable_memory)
> > > > 	kick_rcu()
> > > 
> > > If it is a heuristic anyway, it sounds best to base the heuristic on
> > > the number of objects rather than their aggregate size.
> > 
> > I don't think that'll really work given that object size can very from <
> > 100 bytes all the way up to 2MB hugepages. The shrinker API works that
> > way and I positively hate it; it's really helpful for introspection and
> > debugability later to give good human understandable units to this
> > stuff.
> 
> You might well be right, but let's please try it before adding overhead to
> kfree_rcu() and friends.  I bet it will prove to be good and sufficient.
> 
> > And __ksize() is pretty cheap, and I think there might be room in struct
> > slab to stick the object size there instead of getting it from the slab
> > cache - and folio_size() is cheaper still.
> 
> On __ksize():
> 
>  * This should only be used internally to query the true size of allocations.
>  * It is not meant to be a way to discover the usable size of an allocation
>  * after the fact. Instead, use kmalloc_size_roundup().
> 
> Except that kmalloc_size_roundup() doesn't look like it is meant for
> this use case.  On __ksize() being used only internally, I would not be
> at all averse to kfree_rcu() and friends moving to mm.

__ksize() is the right helper to use for this; ksize() is "how much
usable memory", __ksize() is "how much does this occupy".

> The idea is for kfree_rcu() to invoke __ksize() when given slab memory
> and folio_size() when given vmalloc() memory?

__ksize() for slab memory, but folio_size() would be for page
allocations - actually, I think compound_order() is more appropriate
here, but that's willy's area. IOW, for free_pages_rcu(), which AFAIK we
don't have yet but it looks like we're going to need.

I'm scanning through vmalloc.c and I don't think we have a helper yet to
query the allocation size - I can write one tomorrow, giving my brain a
rest today :)

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27  1:08                                       ` Kent Overstreet
@ 2024-02-27  5:17                                         ` Paul E. McKenney
  2024-02-27  6:21                                           ` Kent Overstreet
  0 siblings, 1 reply; 87+ messages in thread
From: Paul E. McKenney @ 2024-02-27  5:17 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain,
	lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Mon, Feb 26, 2024 at 08:08:17PM -0500, Kent Overstreet wrote:
> On Mon, Feb 26, 2024 at 04:55:29PM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 26, 2024 at 07:29:04PM -0500, Kent Overstreet wrote:
> > > On Mon, Feb 26, 2024 at 04:05:37PM -0800, Paul E. McKenney wrote:
> > > > On Mon, Feb 26, 2024 at 06:29:43PM -0500, Kent Overstreet wrote:
> > > > > Well, we won't want it getting hammered on continuously - we should be
> > > > > able to tune reclaim so that doesn't happen.
> > > > > 
> > > > > I think getting numbers on the amount of memory stranded waiting for RCU
> > > > > is probably first order of business - minor tweak to kfree_rcu() et all
> > > > > for that; there's APIs they can query to maintain that counter.
> > > > 
> > > > We can easily tell you the number of blocks of memory waiting to be freed.
> > > > But RCU does not know their size.  Yes, we could ferret this on each
> > > > call to kmem_free_rcu(), but that might not be great for performance.
> > > > We could traverse the lists at runtime, but such traversal must be done
> > > > with interrupts disabled, which is also not great.
> > > > 
> > > > > then, we can add a heuristic threshhold somewhere, something like 
> > > > > 
> > > > > if (rcu_stranded * multiplier > reclaimable_memory)
> > > > > 	kick_rcu()
> > > > 
> > > > If it is a heuristic anyway, it sounds best to base the heuristic on
> > > > the number of objects rather than their aggregate size.
> > > 
> > > I don't think that'll really work given that object size can very from <
> > > 100 bytes all the way up to 2MB hugepages. The shrinker API works that
> > > way and I positively hate it; it's really helpful for introspection and
> > > debugability later to give good human understandable units to this
> > > stuff.
> > 
> > You might well be right, but let's please try it before adding overhead to
> > kfree_rcu() and friends.  I bet it will prove to be good and sufficient.
> > 
> > > And __ksize() is pretty cheap, and I think there might be room in struct
> > > slab to stick the object size there instead of getting it from the slab
> > > cache - and folio_size() is cheaper still.
> > 
> > On __ksize():
> > 
> >  * This should only be used internally to query the true size of allocations.
> >  * It is not meant to be a way to discover the usable size of an allocation
> >  * after the fact. Instead, use kmalloc_size_roundup().
> > 
> > Except that kmalloc_size_roundup() doesn't look like it is meant for
> > this use case.  On __ksize() being used only internally, I would not be
> > at all averse to kfree_rcu() and friends moving to mm.
> 
> __ksize() is the right helper to use for this; ksize() is "how much
> usable memory", __ksize() is "how much does this occupy".
> 
> > The idea is for kfree_rcu() to invoke __ksize() when given slab memory
> > and folio_size() when given vmalloc() memory?
> 
> __ksize() for slab memory, but folio_size() would be for page
> allocations - actually, I think compound_order() is more appropriate
> here, but that's willy's area. IOW, for free_pages_rcu(), which AFAIK we
> don't have yet but it looks like we're going to need.
> 
> I'm scanning through vmalloc.c and I don't think we have a helper yet to
> query the allocation size - I can write one tomorrow, giving my brain a
> rest today :)

Again, let's give the straight count of blocks a try first.  I do see
that you feel that the added overhead is negligible, but zero added
overhead is even better.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27  5:17                                         ` Paul E. McKenney
@ 2024-02-27  6:21                                           ` Kent Overstreet
  2024-02-27 15:32                                             ` Paul E. McKenney
  0 siblings, 1 reply; 87+ messages in thread
From: Kent Overstreet @ 2024-02-27  6:21 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain,
	lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Mon, Feb 26, 2024 at 09:17:41PM -0800, Paul E. McKenney wrote:
> On Mon, Feb 26, 2024 at 08:08:17PM -0500, Kent Overstreet wrote:
> > On Mon, Feb 26, 2024 at 04:55:29PM -0800, Paul E. McKenney wrote:
> > > On Mon, Feb 26, 2024 at 07:29:04PM -0500, Kent Overstreet wrote:
> > > > On Mon, Feb 26, 2024 at 04:05:37PM -0800, Paul E. McKenney wrote:
> > > > > On Mon, Feb 26, 2024 at 06:29:43PM -0500, Kent Overstreet wrote:
> > > > > > Well, we won't want it getting hammered on continuously - we should be
> > > > > > able to tune reclaim so that doesn't happen.
> > > > > > 
> > > > > > I think getting numbers on the amount of memory stranded waiting for RCU
> > > > > > is probably first order of business - minor tweak to kfree_rcu() et all
> > > > > > for that; there's APIs they can query to maintain that counter.
> > > > > 
> > > > > We can easily tell you the number of blocks of memory waiting to be freed.
> > > > > But RCU does not know their size.  Yes, we could ferret this on each
> > > > > call to kmem_free_rcu(), but that might not be great for performance.
> > > > > We could traverse the lists at runtime, but such traversal must be done
> > > > > with interrupts disabled, which is also not great.
> > > > > 
> > > > > > then, we can add a heuristic threshhold somewhere, something like 
> > > > > > 
> > > > > > if (rcu_stranded * multiplier > reclaimable_memory)
> > > > > > 	kick_rcu()
> > > > > 
> > > > > If it is a heuristic anyway, it sounds best to base the heuristic on
> > > > > the number of objects rather than their aggregate size.
> > > > 
> > > > I don't think that'll really work given that object size can very from <
> > > > 100 bytes all the way up to 2MB hugepages. The shrinker API works that
> > > > way and I positively hate it; it's really helpful for introspection and
> > > > debugability later to give good human understandable units to this
> > > > stuff.
> > > 
> > > You might well be right, but let's please try it before adding overhead to
> > > kfree_rcu() and friends.  I bet it will prove to be good and sufficient.
> > > 
> > > > And __ksize() is pretty cheap, and I think there might be room in struct
> > > > slab to stick the object size there instead of getting it from the slab
> > > > cache - and folio_size() is cheaper still.
> > > 
> > > On __ksize():
> > > 
> > >  * This should only be used internally to query the true size of allocations.
> > >  * It is not meant to be a way to discover the usable size of an allocation
> > >  * after the fact. Instead, use kmalloc_size_roundup().
> > > 
> > > Except that kmalloc_size_roundup() doesn't look like it is meant for
> > > this use case.  On __ksize() being used only internally, I would not be
> > > at all averse to kfree_rcu() and friends moving to mm.
> > 
> > __ksize() is the right helper to use for this; ksize() is "how much
> > usable memory", __ksize() is "how much does this occupy".
> > 
> > > The idea is for kfree_rcu() to invoke __ksize() when given slab memory
> > > and folio_size() when given vmalloc() memory?
> > 
> > __ksize() for slab memory, but folio_size() would be for page
> > allocations - actually, I think compound_order() is more appropriate
> > here, but that's willy's area. IOW, for free_pages_rcu(), which AFAIK we
> > don't have yet but it looks like we're going to need.
> > 
> > I'm scanning through vmalloc.c and I don't think we have a helper yet to
> > query the allocation size - I can write one tomorrow, giving my brain a
> > rest today :)
> 
> Again, let's give the straight count of blocks a try first.  I do see
> that you feel that the added overhead is negligible, but zero added
> overhead is even better.

How are you going to write a heuristic that works correctly both when
the system is cycling through nothing but 2M hugepages, and nothing but
128 byte whatevers?

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-26 23:48                         ` Linus Torvalds
@ 2024-02-27  7:21                           ` Kent Overstreet
  2024-02-27 15:39                             ` Matthew Wilcox
  2024-02-27 16:34                             ` Linus Torvalds
  0 siblings, 2 replies; 87+ messages in thread
From: Kent Overstreet @ 2024-02-27  7:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel,
	linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Mon, Feb 26, 2024 at 03:48:35PM -0800, Linus Torvalds wrote:
> On Mon, 26 Feb 2024 at 14:46, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I really haven't tested this AT ALL. I'm much too scared.
> 
> "Courage is not the absence of fear, but acting in spite of it"
>          - Paddington Bear / Michal Scott
> 
> It seems to actually boot here.
> 
> That said, from a quick test with lots of threads all hammering on the
> same page - I'm still not entirely convinced it makes a difference.
> Sure, the kernel profile changes, but filemap_get_read_batch() wasn't
> very high up in the profile to begin with.
> 
> I didn't do any actual performance testing, I just did a 64-byte pread
> at offset 0 in a loop in 64 threads on my 32c/64t machine.

Only rough testing, but  this is looking like around a 25% performance
increase doing 4k random reads on a 1G file with fio, 8 jobs, on my
Ryzen 5950x - 16.7M -> 21.4M iops, very roughly. fio's a pig and we're
only spending half our cpu time in the kernel, so the buffered read path
is actually getting 40% or 50% faster.

So I'd say that's substantial.

RCU freeing of pagecache pages would be even better - I think that'd let
us completely get rid of the barrier & xarray recheck, and we wouldn't
have to do it as a silly special case.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-23 23:59 [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO Luis Chamberlain
                   ` (2 preceding siblings ...)
  2024-02-26 12:22 ` Dave Chinner
@ 2024-02-27 10:07 ` Kent Overstreet
  2024-02-27 14:08   ` Luis Chamberlain
  2024-02-27 22:13   ` Dave Chinner
  3 siblings, 2 replies; 87+ messages in thread
From: Kent Overstreet @ 2024-02-27 10:07 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner, Matthew Wilcox, Linus Torvalds

On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote:
> Part of the testing we have done with LBS was to do some performance
> tests on XFS to ensure things are not regressing. Building linux is a
> fine decent test and we did some random cloud instance tests on that and
> presented that at Plumbers, but it doesn't really cut it if we want to
> push things to the limit though. What are the limits to buffered IO
> and how do we test that? Who keeps track of it?
> 
> The obvious recurring tension is that for really high performance folks
> just recommend to use birect IO. But if you are stress testing changes
> to a filesystem and want to push buffered IO to its limits it makes
> sense to stick to buffered IO, otherwise how else do we test it?
> 
> It is good to know limits to buffered IO too because some workloads
> cannot use direct IO.  For instance PostgreSQL doesn't have direct IO
> support and even as late as the end of last year we learned that adding
> direct IO to PostgreSQL would be difficult.  Chris Mason has noted also
> that direct IO can also force writes during reads (?)... Anyway, testing
> the limits of buffered IO limits to ensure you are not creating
> regressions when doing some page cache surgery seems like it might be
> useful and a sensible thing to do .... The good news is we have not found
> regressions with LBS but all the testing seems to beg the question, of what
> are the limits of buffered IO anyway, and how does it scale? Do we know, do
> we care? Do we keep track of it? How does it compare to direct IO for some
> workloads? How big is the delta? How do we best test that? How do we
> automate all that? Do we want to automatically test this to avoid regressions?
> 
> The obvious issues with some workloads for buffered IO is having a
> possible penality if you are not really re-using folios added to the
> page cache. Jens Axboe reported a while ago issues with workloads with
> random reads over a data set 10x the size of RAM and also proposed
> RWF_UNCACHED as a way to help [0]. As Chinner put it, this seemed more
> like direct IO with kernel pages and a memcpy(), and it requires
> further serialization to be implemented that we already do for
> direct IO for writes. There at least seems to be agreement that if we're
> going to provide an enhancement or alternative that we should strive to not
> make the same mistakes we've done with direct IO. The rationale for some
> workloads to use buffered IO is it helps reduce some tail latencies, so
> that's something to live up to.
> 
> On that same thread Christoph also mentioned the possibility of a direct
> IO variant which can leverage the cache. Is that something we want to
> move forward with?
> 
> Chris Mason also listed a few other desirables if we do:
> 
> - Allowing concurrent writes (xfs DIO does this now)

AFAIK every filesystem allows concurrent direct writes, not just xfs,
it's _buffered_ writes that we care about here.

I just pushed a patch to my CI for buffered writes without taking the
inode lock - for bcachefs. It'll be straightforward, but a decent amount
of work, to lift this to the VFS, if people are interested in
collaborating.

https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-buffered-write-locking

The approach is: for non extending, non appending writes, see if we can
pin the entire range of the pagecache we're writing to; fall back to
taking the inode lock if we can't.

If we do a short write because of a page fault (despite previously
faulting in the userspace buffer), there is no way to completely prevent
torn writes an atomicity breakage; we could at least try a trylock on
the inode lock, I didn't do that here.

For lifting this to the VFS, this needs
 - My darray code, which I'll be moving to include/linux/ in the 6.9
   merge window
 - My pagecache add lock - we need this for sychronization with hole
   punching and truncate when we don't have the inode lock.
 - My vectorized buffered write path lifted to filemap.c, which means we
   need some sort of vectorized replacement for .write_begin and
   .write_end

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27 10:07 ` Kent Overstreet
@ 2024-02-27 14:08   ` Luis Chamberlain
  2024-02-27 14:57     ` Kent Overstreet
  2024-02-27 22:13   ` Dave Chinner
  1 sibling, 1 reply; 87+ messages in thread
From: Luis Chamberlain @ 2024-02-27 14:08 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner, Matthew Wilcox, Linus Torvalds

On Tue, Feb 27, 2024 at 05:07:30AM -0500, Kent Overstreet wrote:
> On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote:
> > Part of the testing we have done with LBS was to do some performance
> > tests on XFS to ensure things are not regressing. Building linux is a
> > fine decent test and we did some random cloud instance tests on that and
> > presented that at Plumbers, but it doesn't really cut it if we want to
> > push things to the limit though. What are the limits to buffered IO
> > and how do we test that? Who keeps track of it?
> > 
> > The obvious recurring tension is that for really high performance folks
> > just recommend to use birect IO. But if you are stress testing changes
> > to a filesystem and want to push buffered IO to its limits it makes
> > sense to stick to buffered IO, otherwise how else do we test it?
> > 
> > It is good to know limits to buffered IO too because some workloads
> > cannot use direct IO.  For instance PostgreSQL doesn't have direct IO
> > support and even as late as the end of last year we learned that adding
> > direct IO to PostgreSQL would be difficult.  Chris Mason has noted also
> > that direct IO can also force writes during reads (?)... Anyway, testing
> > the limits of buffered IO limits to ensure you are not creating
> > regressions when doing some page cache surgery seems like it might be
> > useful and a sensible thing to do .... The good news is we have not found
> > regressions with LBS but all the testing seems to beg the question, of what
> > are the limits of buffered IO anyway, and how does it scale? Do we know, do
> > we care? Do we keep track of it? How does it compare to direct IO for some
> > workloads? How big is the delta? How do we best test that? How do we
> > automate all that? Do we want to automatically test this to avoid regressions?
> > 
> > The obvious issues with some workloads for buffered IO is having a
> > possible penality if you are not really re-using folios added to the
> > page cache. Jens Axboe reported a while ago issues with workloads with
> > random reads over a data set 10x the size of RAM and also proposed
> > RWF_UNCACHED as a way to help [0]. As Chinner put it, this seemed more
> > like direct IO with kernel pages and a memcpy(), and it requires
> > further serialization to be implemented that we already do for
> > direct IO for writes. There at least seems to be agreement that if we're
> > going to provide an enhancement or alternative that we should strive to not
> > make the same mistakes we've done with direct IO. The rationale for some
> > workloads to use buffered IO is it helps reduce some tail latencies, so
> > that's something to live up to.
> > 
> > On that same thread Christoph also mentioned the possibility of a direct
> > IO variant which can leverage the cache. Is that something we want to
> > move forward with?
> > 
> > Chris Mason also listed a few other desirables if we do:
> > 
> > - Allowing concurrent writes (xfs DIO does this now)
> 
> AFAIK every filesystem allows concurrent direct writes, not just xfs,
> it's _buffered_ writes that we care about here.

The context above was a possible direct IO variant, that's why direct IO
was mentioned and that XFS at least had support.

> I just pushed a patch to my CI for buffered writes without taking the
> inode lock - for bcachefs. It'll be straightforward, but a decent amount
> of work, to lift this to the VFS, if people are interested in
> collaborating.
> 
> https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-buffered-write-locking

Neat, this is sort of what I wanted to get a sense for, if this sort of
topic was worth discussing at LSFMM.

> The approach is: for non extending, non appending writes, see if we can
> pin the entire range of the pagecache we're writing to; fall back to
> taking the inode lock if we can't.

Perhaps a silly thought... but initial reaction is, would it make sense
for the page cache to make this easier for us, so we have this be
easier? It is not clear to me but my first reaction to seeing some of
these deltas was what if we had something like the space split up, as we
do with XFS agcounts, and so each group deals with its own ranges. I
considered this before profiling, and as with Matthew I figured it might
be lock contenton.  It very likely is not for my test case, and as Linus
and Dave has clarified we are both penalized and also have a
singlthreaded writeback.  If we had a group split we'd have locks per
group and perhaps a writeback a dedicated thread per group.

  Luis

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27 14:08   ` Luis Chamberlain
@ 2024-02-27 14:57     ` Kent Overstreet
  0 siblings, 0 replies; 87+ messages in thread
From: Kent Overstreet @ 2024-02-27 14:57 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner, Matthew Wilcox, Linus Torvalds

On Tue, Feb 27, 2024 at 06:08:57AM -0800, Luis Chamberlain wrote:
> On Tue, Feb 27, 2024 at 05:07:30AM -0500, Kent Overstreet wrote:
> > On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote:
> > > Part of the testing we have done with LBS was to do some performance
> > > tests on XFS to ensure things are not regressing. Building linux is a
> > > fine decent test and we did some random cloud instance tests on that and
> > > presented that at Plumbers, but it doesn't really cut it if we want to
> > > push things to the limit though. What are the limits to buffered IO
> > > and how do we test that? Who keeps track of it?
> > > 
> > > The obvious recurring tension is that for really high performance folks
> > > just recommend to use birect IO. But if you are stress testing changes
> > > to a filesystem and want to push buffered IO to its limits it makes
> > > sense to stick to buffered IO, otherwise how else do we test it?
> > > 
> > > It is good to know limits to buffered IO too because some workloads
> > > cannot use direct IO.  For instance PostgreSQL doesn't have direct IO
> > > support and even as late as the end of last year we learned that adding
> > > direct IO to PostgreSQL would be difficult.  Chris Mason has noted also
> > > that direct IO can also force writes during reads (?)... Anyway, testing
> > > the limits of buffered IO limits to ensure you are not creating
> > > regressions when doing some page cache surgery seems like it might be
> > > useful and a sensible thing to do .... The good news is we have not found
> > > regressions with LBS but all the testing seems to beg the question, of what
> > > are the limits of buffered IO anyway, and how does it scale? Do we know, do
> > > we care? Do we keep track of it? How does it compare to direct IO for some
> > > workloads? How big is the delta? How do we best test that? How do we
> > > automate all that? Do we want to automatically test this to avoid regressions?
> > > 
> > > The obvious issues with some workloads for buffered IO is having a
> > > possible penality if you are not really re-using folios added to the
> > > page cache. Jens Axboe reported a while ago issues with workloads with
> > > random reads over a data set 10x the size of RAM and also proposed
> > > RWF_UNCACHED as a way to help [0]. As Chinner put it, this seemed more
> > > like direct IO with kernel pages and a memcpy(), and it requires
> > > further serialization to be implemented that we already do for
> > > direct IO for writes. There at least seems to be agreement that if we're
> > > going to provide an enhancement or alternative that we should strive to not
> > > make the same mistakes we've done with direct IO. The rationale for some
> > > workloads to use buffered IO is it helps reduce some tail latencies, so
> > > that's something to live up to.
> > > 
> > > On that same thread Christoph also mentioned the possibility of a direct
> > > IO variant which can leverage the cache. Is that something we want to
> > > move forward with?
> > > 
> > > Chris Mason also listed a few other desirables if we do:
> > > 
> > > - Allowing concurrent writes (xfs DIO does this now)
> > 
> > AFAIK every filesystem allows concurrent direct writes, not just xfs,
> > it's _buffered_ writes that we care about here.
> 
> The context above was a possible direct IO variant, that's why direct IO
> was mentioned and that XFS at least had support.
> 
> > I just pushed a patch to my CI for buffered writes without taking the
> > inode lock - for bcachefs. It'll be straightforward, but a decent amount
> > of work, to lift this to the VFS, if people are interested in
> > collaborating.
> > 
> > https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-buffered-write-locking
> 
> Neat, this is sort of what I wanted to get a sense for, if this sort of
> topic was worth discussing at LSFMM.
> 
> > The approach is: for non extending, non appending writes, see if we can
> > pin the entire range of the pagecache we're writing to; fall back to
> > taking the inode lock if we can't.
> 
> Perhaps a silly thought... but initial reaction is, would it make sense
> for the page cache to make this easier for us, so we have this be
> easier? It is not clear to me but my first reaction to seeing some of
> these deltas was what if we had something like the space split up, as we
> do with XFS agcounts, and so each group deals with its own ranges. I
> considered this before profiling, and as with Matthew I figured it might
> be lock contenton.  It very likely is not for my test case, and as Linus
> and Dave has clarified we are both penalized and also have a
> singlthreaded writeback.  If we had a group split we'd have locks per
> group and perhaps a writeback a dedicated thread per group.

Wtf are you talking about?

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27  6:21                                           ` Kent Overstreet
@ 2024-02-27 15:32                                             ` Paul E. McKenney
  2024-02-27 15:52                                               ` Kent Overstreet
  2024-02-27 15:54                                               ` Matthew Wilcox
  0 siblings, 2 replies; 87+ messages in thread
From: Paul E. McKenney @ 2024-02-27 15:32 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain,
	lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Tue, Feb 27, 2024 at 01:21:05AM -0500, Kent Overstreet wrote:
> On Mon, Feb 26, 2024 at 09:17:41PM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 26, 2024 at 08:08:17PM -0500, Kent Overstreet wrote:
> > > On Mon, Feb 26, 2024 at 04:55:29PM -0800, Paul E. McKenney wrote:
> > > > On Mon, Feb 26, 2024 at 07:29:04PM -0500, Kent Overstreet wrote:
> > > > > On Mon, Feb 26, 2024 at 04:05:37PM -0800, Paul E. McKenney wrote:
> > > > > > On Mon, Feb 26, 2024 at 06:29:43PM -0500, Kent Overstreet wrote:
> > > > > > > Well, we won't want it getting hammered on continuously - we should be
> > > > > > > able to tune reclaim so that doesn't happen.
> > > > > > > 
> > > > > > > I think getting numbers on the amount of memory stranded waiting for RCU
> > > > > > > is probably first order of business - minor tweak to kfree_rcu() et all
> > > > > > > for that; there's APIs they can query to maintain that counter.
> > > > > > 
> > > > > > We can easily tell you the number of blocks of memory waiting to be freed.
> > > > > > But RCU does not know their size.  Yes, we could ferret this on each
> > > > > > call to kmem_free_rcu(), but that might not be great for performance.
> > > > > > We could traverse the lists at runtime, but such traversal must be done
> > > > > > with interrupts disabled, which is also not great.
> > > > > > 
> > > > > > > then, we can add a heuristic threshhold somewhere, something like 
> > > > > > > 
> > > > > > > if (rcu_stranded * multiplier > reclaimable_memory)
> > > > > > > 	kick_rcu()
> > > > > > 
> > > > > > If it is a heuristic anyway, it sounds best to base the heuristic on
> > > > > > the number of objects rather than their aggregate size.
> > > > > 
> > > > > I don't think that'll really work given that object size can very from <
> > > > > 100 bytes all the way up to 2MB hugepages. The shrinker API works that
> > > > > way and I positively hate it; it's really helpful for introspection and
> > > > > debugability later to give good human understandable units to this
> > > > > stuff.
> > > > 
> > > > You might well be right, but let's please try it before adding overhead to
> > > > kfree_rcu() and friends.  I bet it will prove to be good and sufficient.
> > > > 
> > > > > And __ksize() is pretty cheap, and I think there might be room in struct
> > > > > slab to stick the object size there instead of getting it from the slab
> > > > > cache - and folio_size() is cheaper still.
> > > > 
> > > > On __ksize():
> > > > 
> > > >  * This should only be used internally to query the true size of allocations.
> > > >  * It is not meant to be a way to discover the usable size of an allocation
> > > >  * after the fact. Instead, use kmalloc_size_roundup().
> > > > 
> > > > Except that kmalloc_size_roundup() doesn't look like it is meant for
> > > > this use case.  On __ksize() being used only internally, I would not be
> > > > at all averse to kfree_rcu() and friends moving to mm.
> > > 
> > > __ksize() is the right helper to use for this; ksize() is "how much
> > > usable memory", __ksize() is "how much does this occupy".
> > > 
> > > > The idea is for kfree_rcu() to invoke __ksize() when given slab memory
> > > > and folio_size() when given vmalloc() memory?
> > > 
> > > __ksize() for slab memory, but folio_size() would be for page
> > > allocations - actually, I think compound_order() is more appropriate
> > > here, but that's willy's area. IOW, for free_pages_rcu(), which AFAIK we
> > > don't have yet but it looks like we're going to need.
> > > 
> > > I'm scanning through vmalloc.c and I don't think we have a helper yet to
> > > query the allocation size - I can write one tomorrow, giving my brain a
> > > rest today :)
> > 
> > Again, let's give the straight count of blocks a try first.  I do see
> > that you feel that the added overhead is negligible, but zero added
> > overhead is even better.
> 
> How are you going to write a heuristic that works correctly both when
> the system is cycling through nothing but 2M hugepages, and nothing but
> 128 byte whatevers?

I could simply use the same general approach that I use within RCU
itself, which currently has absolutely no idea how much memory (if any)
that each callback will free.  Especially given that some callbacks
free groups of memory blocks, while other free nothing.  ;-)

Alternatively, we could gather statistics on the amount of memory freed
by each callback and use that as an estimate.

But we should instead step back and ask exactly what we are trying to
accomplish here, which just might be what Dave Chinner was getting at.

At a ridiculously high level, reclaim is looking for memory to free.
Some read-only memory can often be dropped immediately on the grounds
that its data can be read back in if needed.  Other memory can only be
dropped after being written out, which involves a delay.  There are of
course many other complications, but this will do for a start.

So, where does RCU fit in?

RCU fits in between the two.  With memory awaiting RCU, there is no need
to write anything out, but there is a delay.  As such, memory waiting
for an RCU grace period is similar to memory that is to be reclaimed
after its I/O completes.

One complication, and a complication that we are considering exploiting,
is that, unlike reclaimable memory waiting for I/O, we could often
(but not always) have some control over how quickly RCU's grace periods
complete.  And we already do this programmatically by using the choice
between sychronize_rcu() and synchronize_rcu_expedited().  The question
is whether we should expedite normal RCU grace periods during reclaim,
and if so, under what conditions.

You identified one potential condition, namely the amount of memory
waiting to be reclaimed.  One complication with this approach is that RCU
has no idea how much memory each callback represents, and for call_rcu(),
there is no way for it to find out.  For kfree_rcu(), there are ways,
but as you know, I am questioning whether those ways are reasonable from
a performance perspective.  But even if they are, we would be accepting
more error from the memory waiting via call_rcu() than we would be
accepting if we just counted blocks instead of bytes for kfree_rcu().

Let me reiterate that:  The estimation error that you are objecting to
for kfree_rcu() is completely and utterly unavoidable for call_rcu().
RCU callback functions do whatever their authors want, and we won't be
analyzing their code to estimate bytes freed without some serious advances
in code-analysis technology.  Hence my perhaps otherwise inexplicable
insistence on starting with block counts rather than byte counts.

Another complication surrounding estimating memory to be freed is that
this memory can be in any of the following states:

1.	Not yet associated with a specific grace period.  Its CPU (or
	its rcuog kthread, as the case may be) must interact with the
	RCU core and assign a grace period.  There are a few costly
	RCU_STRICT_GRACE_PERIOD tricks that can help here, usually
	involving IPIing a bunch of CPUs or awakening a bunch of rcuog
	kthreads.

2.	Associated with a grace period that has not yet started.
	This grace period must of course be started, which usually
	cannot happen until the previous grace period completes.
	Which leads us to...

3.	Associated with the current grace period.  This is where
	the rcutorture forcing of quiescent states comes in.

4.	Waiting to be invoked.	This happens from either RCU_SOFTIRQ
	context or from rcuoc kthread context, and is of course impeded
	by any of the aforementioned measures to speed things up.
	Perhaps we could crank up the priority of the relevant ksoftirq
	or rcuog/rcuoc kthreads, though this likely has some serious
	side effects.  Besides, as far as I know, we don't mess with
	other process priorities for the benefit of reclaim, so why
	would we start with RCU?

Of these, #3 and #4 are normally the slowest, with #3 often being the
slowest under light call_rcu()/kfree_rcu() load (or in the presence of
slow RCU readers) and #4 often being the slowest under callback-flooding
conditions.  Now reclaim might cause callback flooding, but I don't
recall seeing this.  At least not to the extent as userspace-induced
callback flooding, for example, during "rm -rf" of a big filesystem tree
with lots of small files on a fast device.  So we likely focus on #3.

So we can speed RCU up, and we could use some statistics on quantity
of whatever waiting for RCU.  But is that the right approach?

Keep in mind that reclaim via writeback involves I/O delays.  Now these
delays are often much shorter than the used to be, courtesy of SSDs.
But queueing is still a thing, as are limitations on write bandwidth,
so those delays are still consequential.  My experience is that reclaim
spans seconds rather than milliseconds, let alone microseconds, and RCU
grace periods are in the tens to hundreds of milliseconds.  Or am I yet
again showing my age?

So perhaps we should instead make RCU take "ongoing reclaim" into account
when prodding reluctant grace periods.  For example, RCU normally scans
for idle CPUs every 3 milliseconds.  Maybe it should speed that up to
once per millisecond when reclaim is ongoing.  And likelwise for RCU's
other grace-period-prodding heuristics.

In addition, RCU uses the number of callbacks queued on a particular
CPU to instigate grace-period prodding.  Maybe these heuristics should
also depend on whether or not a reclaim is in progress.

Is there an easy, fast, and reliable way for RCU to determine whether or
not a reclaim is ongoing?  It might be both easier and more effective
for RCU to simply unconditionally react to the existence of a relaim
than for the reclaim process to try to figure out when to prod RCU.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27  7:21                           ` Kent Overstreet
@ 2024-02-27 15:39                             ` Matthew Wilcox
  2024-02-27 15:54                               ` Kent Overstreet
  2024-02-27 16:34                             ` Linus Torvalds
  1 sibling, 1 reply; 87+ messages in thread
From: Matthew Wilcox @ 2024-02-27 15:39 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel,
	linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Tue, Feb 27, 2024 at 02:21:59AM -0500, Kent Overstreet wrote:
> On Mon, Feb 26, 2024 at 03:48:35PM -0800, Linus Torvalds wrote:
> > On Mon, 26 Feb 2024 at 14:46, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > I really haven't tested this AT ALL. I'm much too scared.
> > 
> > "Courage is not the absence of fear, but acting in spite of it"
> >          - Paddington Bear / Michal Scott
> > 
> > It seems to actually boot here.
> > 
> > That said, from a quick test with lots of threads all hammering on the
> > same page - I'm still not entirely convinced it makes a difference.
> > Sure, the kernel profile changes, but filemap_get_read_batch() wasn't
> > very high up in the profile to begin with.
> > 
> > I didn't do any actual performance testing, I just did a 64-byte pread
> > at offset 0 in a loop in 64 threads on my 32c/64t machine.
> 
> Only rough testing, but  this is looking like around a 25% performance
> increase doing 4k random reads on a 1G file with fio, 8 jobs, on my
> Ryzen 5950x - 16.7M -> 21.4M iops, very roughly. fio's a pig and we're
> only spending half our cpu time in the kernel, so the buffered read path
> is actually getting 40% or 50% faster.

Linus' patch only kicks in for 128 bytes or smaller.  So what are you
measuring?

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27 15:32                                             ` Paul E. McKenney
@ 2024-02-27 15:52                                               ` Kent Overstreet
  2024-02-27 16:06                                                 ` Paul E. McKenney
  2024-02-27 15:54                                               ` Matthew Wilcox
  1 sibling, 1 reply; 87+ messages in thread
From: Kent Overstreet @ 2024-02-27 15:52 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain,
	lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Tue, Feb 27, 2024 at 07:32:32AM -0800, Paul E. McKenney wrote:
> I could simply use the same general approach that I use within RCU
> itself, which currently has absolutely no idea how much memory (if any)
> that each callback will free.  Especially given that some callbacks
> free groups of memory blocks, while other free nothing.  ;-)
> 
> Alternatively, we could gather statistics on the amount of memory freed
> by each callback and use that as an estimate.
> 
> But we should instead step back and ask exactly what we are trying to
> accomplish here, which just might be what Dave Chinner was getting at.
> 
> At a ridiculously high level, reclaim is looking for memory to free.
> Some read-only memory can often be dropped immediately on the grounds
> that its data can be read back in if needed.  Other memory can only be
> dropped after being written out, which involves a delay.  There are of
> course many other complications, but this will do for a start.
> 
> So, where does RCU fit in?
> 
> RCU fits in between the two.  With memory awaiting RCU, there is no need
> to write anything out, but there is a delay.  As such, memory waiting
> for an RCU grace period is similar to memory that is to be reclaimed
> after its I/O completes.
> 
> One complication, and a complication that we are considering exploiting,
> is that, unlike reclaimable memory waiting for I/O, we could often
> (but not always) have some control over how quickly RCU's grace periods
> complete.  And we already do this programmatically by using the choice
> between sychronize_rcu() and synchronize_rcu_expedited().  The question
> is whether we should expedite normal RCU grace periods during reclaim,
> and if so, under what conditions.
> 
> You identified one potential condition, namely the amount of memory
> waiting to be reclaimed.  One complication with this approach is that RCU
> has no idea how much memory each callback represents, and for call_rcu(),
> there is no way for it to find out.  For kfree_rcu(), there are ways,
> but as you know, I am questioning whether those ways are reasonable from
> a performance perspective.  But even if they are, we would be accepting
> more error from the memory waiting via call_rcu() than we would be
> accepting if we just counted blocks instead of bytes for kfree_rcu().

You're _way_ overcomplicating this.

The relevant thing to consider is the relative cost of __ksize() and
kfree_rcu(). __ksize() is already pretty cheap, and with slab gone and
space available in struct slab we can get it down to a single load.

> Let me reiterate that:  The estimation error that you are objecting to
> for kfree_rcu() is completely and utterly unavoidable for call_rcu().

hardly, callsites manually freeing memory manually after an RCU grace
period can do the accounting manually - if they're hot enough to matter,
most aren.t

and with memory allocation profiling coming, which also tracks # of
allocations, we'll also have an easy way to spot those.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27 15:39                             ` Matthew Wilcox
@ 2024-02-27 15:54                               ` Kent Overstreet
  0 siblings, 0 replies; 87+ messages in thread
From: Kent Overstreet @ 2024-02-27 15:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel,
	linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Tue, Feb 27, 2024 at 03:39:35PM +0000, Matthew Wilcox wrote:
> On Tue, Feb 27, 2024 at 02:21:59AM -0500, Kent Overstreet wrote:
> > On Mon, Feb 26, 2024 at 03:48:35PM -0800, Linus Torvalds wrote:
> > > On Mon, 26 Feb 2024 at 14:46, Linus Torvalds
> > > <torvalds@linux-foundation.org> wrote:
> > > >
> > > > I really haven't tested this AT ALL. I'm much too scared.
> > > 
> > > "Courage is not the absence of fear, but acting in spite of it"
> > >          - Paddington Bear / Michal Scott
> > > 
> > > It seems to actually boot here.
> > > 
> > > That said, from a quick test with lots of threads all hammering on the
> > > same page - I'm still not entirely convinced it makes a difference.
> > > Sure, the kernel profile changes, but filemap_get_read_batch() wasn't
> > > very high up in the profile to begin with.
> > > 
> > > I didn't do any actual performance testing, I just did a 64-byte pread
> > > at offset 0 in a loop in 64 threads on my 32c/64t machine.
> > 
> > Only rough testing, but  this is looking like around a 25% performance
> > increase doing 4k random reads on a 1G file with fio, 8 jobs, on my
> > Ryzen 5950x - 16.7M -> 21.4M iops, very roughly. fio's a pig and we're
> > only spending half our cpu time in the kernel, so the buffered read path
> > is actually getting 40% or 50% faster.
> 
> Linus' patch only kicks in for 128 bytes or smaller.  So what are you
> measuring?

64 byte reads

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27 15:32                                             ` Paul E. McKenney
  2024-02-27 15:52                                               ` Kent Overstreet
@ 2024-02-27 15:54                                               ` Matthew Wilcox
  2024-02-27 16:21                                                 ` Paul E. McKenney
  1 sibling, 1 reply; 87+ messages in thread
From: Matthew Wilcox @ 2024-02-27 15:54 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Kent Overstreet, Linus Torvalds, Al Viro, Luis Chamberlain,
	lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Tue, Feb 27, 2024 at 07:32:32AM -0800, Paul E. McKenney wrote:
> At a ridiculously high level, reclaim is looking for memory to free.
> Some read-only memory can often be dropped immediately on the grounds
> that its data can be read back in if needed.  Other memory can only be
> dropped after being written out, which involves a delay.  There are of
> course many other complications, but this will do for a start.

Hi Paul,

I appreciate the necessity of describing what's going on at a very high
level, but there's a wrinkle that I'm not sure you're aware of which
may substantially change your argument.

For anonymous memory, we do indeed wait until reclaim to start writing it
to swap.  That may or may not be the right approach given how anonymous
memory is used (and could be the topic of an interesting discussion
at LSFMM).

For file-backed memory, we do not write back memory in reclaim.  If it
has got to the point of calling ->writepage in vmscan, things have gone
horribly wrong to the point where calling ->writepage will make things
worse.  This is why we're currently removing ->writepage from every
filesystem (only ->writepages will remain).  Instead, the page cache
is written back much earlier, once we get to balance_dirty_pages().
That lets us write pages in filesystem-friendly ways instead of in MM
LRU order.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27 15:52                                               ` Kent Overstreet
@ 2024-02-27 16:06                                                 ` Paul E. McKenney
  0 siblings, 0 replies; 87+ messages in thread
From: Paul E. McKenney @ 2024-02-27 16:06 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain,
	lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Tue, Feb 27, 2024 at 10:52:51AM -0500, Kent Overstreet wrote:
> On Tue, Feb 27, 2024 at 07:32:32AM -0800, Paul E. McKenney wrote:
> > I could simply use the same general approach that I use within RCU
> > itself, which currently has absolutely no idea how much memory (if any)
> > that each callback will free.  Especially given that some callbacks
> > free groups of memory blocks, while other free nothing.  ;-)
> > 
> > Alternatively, we could gather statistics on the amount of memory freed
> > by each callback and use that as an estimate.
> > 
> > But we should instead step back and ask exactly what we are trying to
> > accomplish here, which just might be what Dave Chinner was getting at.
> > 
> > At a ridiculously high level, reclaim is looking for memory to free.
> > Some read-only memory can often be dropped immediately on the grounds
> > that its data can be read back in if needed.  Other memory can only be
> > dropped after being written out, which involves a delay.  There are of
> > course many other complications, but this will do for a start.
> > 
> > So, where does RCU fit in?
> > 
> > RCU fits in between the two.  With memory awaiting RCU, there is no need
> > to write anything out, but there is a delay.  As such, memory waiting
> > for an RCU grace period is similar to memory that is to be reclaimed
> > after its I/O completes.
> > 
> > One complication, and a complication that we are considering exploiting,
> > is that, unlike reclaimable memory waiting for I/O, we could often
> > (but not always) have some control over how quickly RCU's grace periods
> > complete.  And we already do this programmatically by using the choice
> > between sychronize_rcu() and synchronize_rcu_expedited().  The question
> > is whether we should expedite normal RCU grace periods during reclaim,
> > and if so, under what conditions.
> > 
> > You identified one potential condition, namely the amount of memory
> > waiting to be reclaimed.  One complication with this approach is that RCU
> > has no idea how much memory each callback represents, and for call_rcu(),
> > there is no way for it to find out.  For kfree_rcu(), there are ways,
> > but as you know, I am questioning whether those ways are reasonable from
> > a performance perspective.  But even if they are, we would be accepting
> > more error from the memory waiting via call_rcu() than we would be
> > accepting if we just counted blocks instead of bytes for kfree_rcu().
> 
> You're _way_ overcomplicating this.

Sorry, but no.

Please read the remainder of my prior email carefully.

							Thanx, Paul

> The relevant thing to consider is the relative cost of __ksize() and
> kfree_rcu(). __ksize() is already pretty cheap, and with slab gone and
> space available in struct slab we can get it down to a single load.
> 
> > Let me reiterate that:  The estimation error that you are objecting to
> > for kfree_rcu() is completely and utterly unavoidable for call_rcu().
> 
> hardly, callsites manually freeing memory manually after an RCU grace
> period can do the accounting manually - if they're hot enough to matter,
> most aren.t
> 
> and with memory allocation profiling coming, which also tracks # of
> allocations, we'll also have an easy way to spot those.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27 15:54                                               ` Matthew Wilcox
@ 2024-02-27 16:21                                                 ` Paul E. McKenney
  2024-02-27 16:34                                                   ` Kent Overstreet
  0 siblings, 1 reply; 87+ messages in thread
From: Paul E. McKenney @ 2024-02-27 16:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kent Overstreet, Linus Torvalds, Al Viro, Luis Chamberlain,
	lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Tue, Feb 27, 2024 at 03:54:23PM +0000, Matthew Wilcox wrote:
> On Tue, Feb 27, 2024 at 07:32:32AM -0800, Paul E. McKenney wrote:
> > At a ridiculously high level, reclaim is looking for memory to free.
> > Some read-only memory can often be dropped immediately on the grounds
> > that its data can be read back in if needed.  Other memory can only be
> > dropped after being written out, which involves a delay.  There are of
> > course many other complications, but this will do for a start.
> 
> Hi Paul,
> 
> I appreciate the necessity of describing what's going on at a very high
> level, but there's a wrinkle that I'm not sure you're aware of which
> may substantially change your argument.
> 
> For anonymous memory, we do indeed wait until reclaim to start writing it
> to swap.  That may or may not be the right approach given how anonymous
> memory is used (and could be the topic of an interesting discussion
> at LSFMM).
> 
> For file-backed memory, we do not write back memory in reclaim.  If it
> has got to the point of calling ->writepage in vmscan, things have gone
> horribly wrong to the point where calling ->writepage will make things
> worse.  This is why we're currently removing ->writepage from every
> filesystem (only ->writepages will remain).  Instead, the page cache
> is written back much earlier, once we get to balance_dirty_pages().
> That lets us write pages in filesystem-friendly ways instead of in MM
> LRU order.

Thank you for the additional details.

But please allow me to further summarize the point of my prior email
that seems to be getting lost:

1.	RCU already does significant work prodding grace periods.

2.	There is no reasonable way to provide estimates of the
	memory sent to RCU via call_rcu(), and in many cases
	the bulk of the waiting memory will be call_rcu() memory.

Therefore, if we cannot come up with a heuristic that does not need to
know the bytes of memory waiting, we are stuck anyway.

So perhaps the proper heuristic for RCU speeding things up is simply
"Hey RCU, we are in reclaim!".

Or do you have hard data showing that this is insufficient?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27 16:21                                                 ` Paul E. McKenney
@ 2024-02-27 16:34                                                   ` Kent Overstreet
  2024-02-27 17:58                                                     ` Paul E. McKenney
  0 siblings, 1 reply; 87+ messages in thread
From: Kent Overstreet @ 2024-02-27 16:34 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain,
	lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Tue, Feb 27, 2024 at 08:21:29AM -0800, Paul E. McKenney wrote:
> On Tue, Feb 27, 2024 at 03:54:23PM +0000, Matthew Wilcox wrote:
> > On Tue, Feb 27, 2024 at 07:32:32AM -0800, Paul E. McKenney wrote:
> > > At a ridiculously high level, reclaim is looking for memory to free.
> > > Some read-only memory can often be dropped immediately on the grounds
> > > that its data can be read back in if needed.  Other memory can only be
> > > dropped after being written out, which involves a delay.  There are of
> > > course many other complications, but this will do for a start.
> > 
> > Hi Paul,
> > 
> > I appreciate the necessity of describing what's going on at a very high
> > level, but there's a wrinkle that I'm not sure you're aware of which
> > may substantially change your argument.
> > 
> > For anonymous memory, we do indeed wait until reclaim to start writing it
> > to swap.  That may or may not be the right approach given how anonymous
> > memory is used (and could be the topic of an interesting discussion
> > at LSFMM).
> > 
> > For file-backed memory, we do not write back memory in reclaim.  If it
> > has got to the point of calling ->writepage in vmscan, things have gone
> > horribly wrong to the point where calling ->writepage will make things
> > worse.  This is why we're currently removing ->writepage from every
> > filesystem (only ->writepages will remain).  Instead, the page cache
> > is written back much earlier, once we get to balance_dirty_pages().
> > That lets us write pages in filesystem-friendly ways instead of in MM
> > LRU order.
> 
> Thank you for the additional details.
> 
> But please allow me to further summarize the point of my prior email
> that seems to be getting lost:
> 
> 1.	RCU already does significant work prodding grace periods.
> 
> 2.	There is no reasonable way to provide estimates of the
> 	memory sent to RCU via call_rcu(), and in many cases
> 	the bulk of the waiting memory will be call_rcu() memory.
> 
> Therefore, if we cannot come up with a heuristic that does not need to
> know the bytes of memory waiting, we are stuck anyway.

That is a completely asinine argument.

> So perhaps the proper heuristic for RCU speeding things up is simply
> "Hey RCU, we are in reclaim!".

Because that's the wrong heuristic. There are important workloads for
which  we're _always_ in reclaim, but as long as RCU grace periods are
happening at some steady rate, the amount of memory stranded will be
bounded and there's no reason to expedite grace periods.

If we start RCU freeing all pagecache folios we're going to be cycling
memory through RCU freeing at the rate of gigabytes per second, tens of
gigabytes per second on high end systems.

Do you put hard limits on how long we can go before an RCU grace period
that will limit the amount of memory stranded to something acceptable?
Yes or no?

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27  7:21                           ` Kent Overstreet
  2024-02-27 15:39                             ` Matthew Wilcox
@ 2024-02-27 16:34                             ` Linus Torvalds
  2024-02-27 16:47                               ` Kent Overstreet
  1 sibling, 1 reply; 87+ messages in thread
From: Linus Torvalds @ 2024-02-27 16:34 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Al Viro, Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel,
	linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Mon, 26 Feb 2024 at 23:22, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> Only rough testing, but  this is looking like around a 25% performance
> increase doing 4k random reads on a 1G file with fio, 8 jobs, on my
> Ryzen 5950x - 16.7M -> 21.4M iops, very roughly. fio's a pig and we're
> only spending half our cpu time in the kernel, so the buffered read path
> is actually getting 40% or 50% faster.
>
> So I'd say that's substantial.

No,  you're doing something wrong. The new fastread logic only
triggers for reads <= 128 bytes, so you must have done some other
major change (like built a kernel without the mitigations, and
compared it to one with mitigations - that would easily be 25%
depending on hardware).

                  Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27 16:34                             ` Linus Torvalds
@ 2024-02-27 16:47                               ` Kent Overstreet
  2024-02-27 17:07                                 ` Linus Torvalds
  0 siblings, 1 reply; 87+ messages in thread
From: Kent Overstreet @ 2024-02-27 16:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel,
	linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Tue, Feb 27, 2024 at 08:34:39AM -0800, Linus Torvalds wrote:
> On Mon, 26 Feb 2024 at 23:22, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > Only rough testing, but  this is looking like around a 25% performance
> > increase doing 4k random reads on a 1G file with fio, 8 jobs, on my
> > Ryzen 5950x - 16.7M -> 21.4M iops, very roughly. fio's a pig and we're
> > only spending half our cpu time in the kernel, so the buffered read path
> > is actually getting 40% or 50% faster.
> >
> > So I'd say that's substantial.
> 
> No,  you're doing something wrong. The new fastread logic only
> triggers for reads <= 128 bytes, so you must have done some other
> major change (like built a kernel without the mitigations, and
> compared it to one with mitigations - that would easily be 25%
> depending on hardware).

Like I replied to willy - the "4k" was a typo from force of habit, I was
doing 64 byte random reads.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27 16:47                               ` Kent Overstreet
@ 2024-02-27 17:07                                 ` Linus Torvalds
  2024-02-27 17:20                                   ` Kent Overstreet
  0 siblings, 1 reply; 87+ messages in thread
From: Linus Torvalds @ 2024-02-27 17:07 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Al Viro, Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel,
	linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Tue, 27 Feb 2024 at 08:47, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> Like I replied to willy - the "4k" was a typo from force of habit, I was
> doing 64 byte random reads.

Ok, can you send the exact fio command? I'll do some testing here too.

                  Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27 17:07                                 ` Linus Torvalds
@ 2024-02-27 17:20                                   ` Kent Overstreet
  2024-02-27 18:02                                     ` Linus Torvalds
  0 siblings, 1 reply; 87+ messages in thread
From: Kent Overstreet @ 2024-02-27 17:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel,
	linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Tue, Feb 27, 2024 at 09:07:01AM -0800, Linus Torvalds wrote:
> On Tue, 27 Feb 2024 at 08:47, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > Like I replied to willy - the "4k" was a typo from force of habit, I was
> > doing 64 byte random reads.
> 
> Ok, can you send the exact fio command? I'll do some testing here too.
> 
>                   Linus

fio					\
	--group_reporting		\
	--gtod_reduce=1			\
	--norandommap			\
	--runtime=60s			\
	--exitall_on_error=1		\
	--filename=/root/foo		\
	--name=read			\
	--rw=randread			\
	--loops=100			\
	--numjobs=8			\
	--filesize=1G			\
	--bs=64b			\
	--io_size=10G

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27 16:34                                                   ` Kent Overstreet
@ 2024-02-27 17:58                                                     ` Paul E. McKenney
  2024-02-28 23:55                                                       ` Kent Overstreet
  0 siblings, 1 reply; 87+ messages in thread
From: Paul E. McKenney @ 2024-02-27 17:58 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain,
	lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Tue, Feb 27, 2024 at 11:34:06AM -0500, Kent Overstreet wrote:
> On Tue, Feb 27, 2024 at 08:21:29AM -0800, Paul E. McKenney wrote:
> > On Tue, Feb 27, 2024 at 03:54:23PM +0000, Matthew Wilcox wrote:
> > > On Tue, Feb 27, 2024 at 07:32:32AM -0800, Paul E. McKenney wrote:
> > > > At a ridiculously high level, reclaim is looking for memory to free.
> > > > Some read-only memory can often be dropped immediately on the grounds
> > > > that its data can be read back in if needed.  Other memory can only be
> > > > dropped after being written out, which involves a delay.  There are of
> > > > course many other complications, but this will do for a start.
> > > 
> > > Hi Paul,
> > > 
> > > I appreciate the necessity of describing what's going on at a very high
> > > level, but there's a wrinkle that I'm not sure you're aware of which
> > > may substantially change your argument.
> > > 
> > > For anonymous memory, we do indeed wait until reclaim to start writing it
> > > to swap.  That may or may not be the right approach given how anonymous
> > > memory is used (and could be the topic of an interesting discussion
> > > at LSFMM).
> > > 
> > > For file-backed memory, we do not write back memory in reclaim.  If it
> > > has got to the point of calling ->writepage in vmscan, things have gone
> > > horribly wrong to the point where calling ->writepage will make things
> > > worse.  This is why we're currently removing ->writepage from every
> > > filesystem (only ->writepages will remain).  Instead, the page cache
> > > is written back much earlier, once we get to balance_dirty_pages().
> > > That lets us write pages in filesystem-friendly ways instead of in MM
> > > LRU order.
> > 
> > Thank you for the additional details.
> > 
> > But please allow me to further summarize the point of my prior email
> > that seems to be getting lost:
> > 
> > 1.	RCU already does significant work prodding grace periods.
> > 
> > 2.	There is no reasonable way to provide estimates of the
> > 	memory sent to RCU via call_rcu(), and in many cases
> > 	the bulk of the waiting memory will be call_rcu() memory.
> > 
> > Therefore, if we cannot come up with a heuristic that does not need to
> > know the bytes of memory waiting, we are stuck anyway.
> 
> That is a completely asinine argument.

Huh.  Anything else you need to get off your chest?

On the off-chance it is unclear, I do disagree with your assessment.

> > So perhaps the proper heuristic for RCU speeding things up is simply
> > "Hey RCU, we are in reclaim!".
> 
> Because that's the wrong heuristic. There are important workloads for
> which  we're _always_ in reclaim, but as long as RCU grace periods are
> happening at some steady rate, the amount of memory stranded will be
> bounded and there's no reason to expedite grace periods.

RCU is in fact designed to handle heavy load, and contains a number of
mechanisms to operate more efficiently at higher load than at lower load.
It also contains mechanisms to expedite grace periods under heavy load.
Some of which I already described in earlier emails on this thread.

> If we start RCU freeing all pagecache folios we're going to be cycling
> memory through RCU freeing at the rate of gigabytes per second, tens of
> gigabytes per second on high end systems.

The load on RCU would be measured in terms of requests (kfree_rcu() and
friends) per unit time per CPU, not in terms of gigabytes per unit time.
Of course, the amount of memory per unit time might be an issue for
whatever allocators you are using, and as Matthew has often pointed out,
the deferred reclamation incurs additional cache misses.

And rcutorture really does do forward-progress testing on vanilla RCU
that features tight in-kernel loops doing call_rcu() without bounds on
memory in flight, and RCU routinely survives this in a VM that is given
only 512MB of memory.  In fact, any failure to survive this is considered
a bug to be fixed.

So I suspect RCU is quite capable of handling this load.  But how many
additional kfree_rcu() calls are you anticipating per unit time per CPU?
For example, could we simply measure the rate at which pagecache folios
are currently being freed under heavy load?  Given that information,
we could just try it out.

> Do you put hard limits on how long we can go before an RCU grace period
> that will limit the amount of memory stranded to something acceptable?
> Yes or no?

C'mon, Kent, you can do better than this.

Given what you have described, I believe that RCU would have no problem
with it.  However, additional information would be welcome.  As always,
I could well be missing something.

And yes, kernel developers can break RCU (with infinite loops in RCU
readers being a popular approach) and systems administrators can break
RCU, for example, by confining processing of offloading callbacks to a
single CPU on a 128-CPU system.  Don't do that.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27 17:20                                   ` Kent Overstreet
@ 2024-02-27 18:02                                     ` Linus Torvalds
  0 siblings, 0 replies; 87+ messages in thread
From: Linus Torvalds @ 2024-02-27 18:02 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Al Viro, Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel,
	linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Tue, 27 Feb 2024 at 09:20, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> fio                                     \
>         --group_reporting               \
[..]

Ok, I have to say that my kernel profile looks very nice with that
patch. It all looks almost perfect, and the main big spike in the
profile is the inevitable "clac/stac" pair around the user copy
because my old threadripper just ends up not doing that very well.

Of course, I disable all the mitigations for my profiling runs,
because not doing that is just too depressing and any real kernel
issues that we can do something about get hidden by the noise
introduced by the mitigations. So my profiles are entirely artificial
in that sense (I tell myself that it's what future CPU's will look
like and that it's ok. Then I cry myself to sleep).

I'm not going to push that patch of mine - I still suspect that doing
this all under RCU with page faults optimistically disabled would be a
much nicer model and work for bigger areas - but I don't hate the
patch either, and if somebody else wants to push it, I'm ok with it.

I can't actually think of any real load that does a lot of small
reads. Willy's odd customer not-withstanding.

                      Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27 10:07 ` Kent Overstreet
  2024-02-27 14:08   ` Luis Chamberlain
@ 2024-02-27 22:13   ` Dave Chinner
  2024-02-27 22:21     ` Kent Overstreet
  1 sibling, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2024-02-27 22:13 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez,
	Pankaj Raghav, Jens Axboe, Christoph Hellwig, Chris Mason,
	Johannes Weiner, Matthew Wilcox, Linus Torvalds

On Tue, Feb 27, 2024 at 05:07:30AM -0500, Kent Overstreet wrote:
> On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote:
> > Part of the testing we have done with LBS was to do some performance
> > tests on XFS to ensure things are not regressing. Building linux is a
> > fine decent test and we did some random cloud instance tests on that and
> > presented that at Plumbers, but it doesn't really cut it if we want to
> > push things to the limit though. What are the limits to buffered IO
> > and how do we test that? Who keeps track of it?
> > 
> > The obvious recurring tension is that for really high performance folks
> > just recommend to use birect IO. But if you are stress testing changes
> > to a filesystem and want to push buffered IO to its limits it makes
> > sense to stick to buffered IO, otherwise how else do we test it?
> > 
> > It is good to know limits to buffered IO too because some workloads
> > cannot use direct IO.  For instance PostgreSQL doesn't have direct IO
> > support and even as late as the end of last year we learned that adding
> > direct IO to PostgreSQL would be difficult.  Chris Mason has noted also
> > that direct IO can also force writes during reads (?)... Anyway, testing
> > the limits of buffered IO limits to ensure you are not creating
> > regressions when doing some page cache surgery seems like it might be
> > useful and a sensible thing to do .... The good news is we have not found
> > regressions with LBS but all the testing seems to beg the question, of what
> > are the limits of buffered IO anyway, and how does it scale? Do we know, do
> > we care? Do we keep track of it? How does it compare to direct IO for some
> > workloads? How big is the delta? How do we best test that? How do we
> > automate all that? Do we want to automatically test this to avoid regressions?
> > 
> > The obvious issues with some workloads for buffered IO is having a
> > possible penality if you are not really re-using folios added to the
> > page cache. Jens Axboe reported a while ago issues with workloads with
> > random reads over a data set 10x the size of RAM and also proposed
> > RWF_UNCACHED as a way to help [0]. As Chinner put it, this seemed more
> > like direct IO with kernel pages and a memcpy(), and it requires
> > further serialization to be implemented that we already do for
> > direct IO for writes. There at least seems to be agreement that if we're
> > going to provide an enhancement or alternative that we should strive to not
> > make the same mistakes we've done with direct IO. The rationale for some
> > workloads to use buffered IO is it helps reduce some tail latencies, so
> > that's something to live up to.
> > 
> > On that same thread Christoph also mentioned the possibility of a direct
> > IO variant which can leverage the cache. Is that something we want to
> > move forward with?
> > 
> > Chris Mason also listed a few other desirables if we do:
> > 
> > - Allowing concurrent writes (xfs DIO does this now)
> 
> AFAIK every filesystem allows concurrent direct writes, not just xfs,
> it's _buffered_ writes that we care about here.

We could do concurrent buffered writes in XFS - we would just use
the same locking strategy as direct IO and fall back on folio locks
for copy-in exclusion like ext4 does.

The real question is how much of userspace will that break, because
of implicit assumptions that the kernel has always serialised
buffered writes?

> I just pushed a patch to my CI for buffered writes without taking the
> inode lock - for bcachefs. It'll be straightforward, but a decent amount
> of work, to lift this to the VFS, if people are interested in
> collaborating.

Yeah, XFS would just revert to shared inode locking - we still need
the inode lock for things like truncate/fallocate exlusion.

> https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-buffered-write-locking
> 
> The approach is: for non extending, non appending writes, see if we can
> pin the entire range of the pagecache we're writing to; fall back to
> taking the inode lock if we can't.

XFS just falls back to exclusive locking if the file needs
extending.

> If we do a short write because of a page fault (despite previously
> faulting in the userspace buffer), there is no way to completely prevent
> torn writes an atomicity breakage; we could at least try a trylock on
> the inode lock, I didn't do that here.

As soon as we go for concurrent writes, we give up on any concept of
atomicity of buffered writes (esp. w.r.t reads), so this really
doesn't matter at all.

> For lifting this to the VFS, this needs
>  - My darray code, which I'll be moving to include/linux/ in the 6.9
>    merge window
>  - My pagecache add lock - we need this for sychronization with hole
>    punching and truncate when we don't have the inode lock.
>  - My vectorized buffered write path lifted to filemap.c, which means we
>    need some sort of vectorized replacement for .write_begin and
>    .write_end

I don't think we need any of that - I think you're over complicating
it. As long as the filesystems has a mechanism that works for
concurrent DIO writes, they can just reuse that for concurrent
buffered writes....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27 22:13   ` Dave Chinner
@ 2024-02-27 22:21     ` Kent Overstreet
  2024-02-27 22:42       ` Dave Chinner
  2024-02-27 22:46       ` Linus Torvalds
  0 siblings, 2 replies; 87+ messages in thread
From: Kent Overstreet @ 2024-02-27 22:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez,
	Pankaj Raghav, Jens Axboe, Christoph Hellwig, Chris Mason,
	Johannes Weiner, Matthew Wilcox, Linus Torvalds

On Wed, Feb 28, 2024 at 09:13:05AM +1100, Dave Chinner wrote:
> On Tue, Feb 27, 2024 at 05:07:30AM -0500, Kent Overstreet wrote:
> > AFAIK every filesystem allows concurrent direct writes, not just xfs,
> > it's _buffered_ writes that we care about here.
> 
> We could do concurrent buffered writes in XFS - we would just use
> the same locking strategy as direct IO and fall back on folio locks
> for copy-in exclusion like ext4 does.

ext4 code doesn't do that. it takes the inode lock in exclusive mode,
just like everyone else.

> The real question is how much of userspace will that break, because
> of implicit assumptions that the kernel has always serialised
> buffered writes?

What would break?

> > If we do a short write because of a page fault (despite previously
> > faulting in the userspace buffer), there is no way to completely prevent
> > torn writes an atomicity breakage; we could at least try a trylock on
> > the inode lock, I didn't do that here.
> 
> As soon as we go for concurrent writes, we give up on any concept of
> atomicity of buffered writes (esp. w.r.t reads), so this really
> doesn't matter at all.

We've already given up buffered write vs. read atomicity, have for a
long time - buffered read path takes no locks.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27 22:21     ` Kent Overstreet
@ 2024-02-27 22:42       ` Dave Chinner
  2024-02-28  7:48         ` [Lsf-pc] " Amir Goldstein
  2024-02-27 22:46       ` Linus Torvalds
  1 sibling, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2024-02-27 22:42 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez,
	Pankaj Raghav, Jens Axboe, Christoph Hellwig, Chris Mason,
	Johannes Weiner, Matthew Wilcox, Linus Torvalds

On Tue, Feb 27, 2024 at 05:21:20PM -0500, Kent Overstreet wrote:
> On Wed, Feb 28, 2024 at 09:13:05AM +1100, Dave Chinner wrote:
> > On Tue, Feb 27, 2024 at 05:07:30AM -0500, Kent Overstreet wrote:
> > > AFAIK every filesystem allows concurrent direct writes, not just xfs,
> > > it's _buffered_ writes that we care about here.
> > 
> > We could do concurrent buffered writes in XFS - we would just use
> > the same locking strategy as direct IO and fall back on folio locks
> > for copy-in exclusion like ext4 does.
> 
> ext4 code doesn't do that. it takes the inode lock in exclusive mode,
> just like everyone else.

Uhuh. ext4 does allow concurrent DIO writes. It's just much more
constrained than XFS. See ext4_dio_write_checks().

> > The real question is how much of userspace will that break, because
> > of implicit assumptions that the kernel has always serialised
> > buffered writes?
> 
> What would break?

Good question. If you don't know the answer, then you've got the
same problem as I have. i.e. we don't know if concurrent
applications that use buffered IO extensively (eg. postgres) assume
data coherency because of the implicit serialisation occurring
during buffered IO writes?

> > > If we do a short write because of a page fault (despite previously
> > > faulting in the userspace buffer), there is no way to completely prevent
> > > torn writes an atomicity breakage; we could at least try a trylock on
> > > the inode lock, I didn't do that here.
> > 
> > As soon as we go for concurrent writes, we give up on any concept of
> > atomicity of buffered writes (esp. w.r.t reads), so this really
> > doesn't matter at all.
> 
> We've already given up buffered write vs. read atomicity, have for a
> long time - buffered read path takes no locks.

We still have explicit buffered read() vs buffered write() atomicity
in XFS via buffered reads taking the inode lock shared (see
xfs_file_buffered_read()) because that's what POSIX says we should
have.

Essentially, we need to explicitly give POSIX the big finger and
state that there are no atomicity guarantees given for write() calls
of any size, nor are there any guarantees for data coherency for
any overlapping concurrent buffered IO operations.

Those are things we haven't completely given up yet w.r.t. buffered
IO, and enabling concurrent buffered writes will expose to users.
So we need to have explicit policies for this and document them
clearly in all the places that application developers might look
for behavioural hints.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27 22:21     ` Kent Overstreet
  2024-02-27 22:42       ` Dave Chinner
@ 2024-02-27 22:46       ` Linus Torvalds
  2024-02-27 23:00         ` Linus Torvalds
                           ` (2 more replies)
  1 sibling, 3 replies; 87+ messages in thread
From: Linus Torvalds @ 2024-02-27 22:46 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Dave Chinner, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm,
	Daniel Gomez, Pankaj Raghav, Jens Axboe, Christoph Hellwig,
	Chris Mason, Johannes Weiner, Matthew Wilcox

On Tue, 27 Feb 2024 at 14:21, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> ext4 code doesn't do that. it takes the inode lock in exclusive mode,
> just like everyone else.

Not for dio, it doesn't.

> > The real question is how much of userspace will that break, because
> > of implicit assumptions that the kernel has always serialised
> > buffered writes?
>
> What would break?

Well, at least in theory you could have concurrent overlapping writes
of folio crossing records, and currently you do get the guarantee that
one or the other record is written, but relying just on page locking
would mean that you might get a mix of them at page boundaries.

I'm not sure that such a model would make any sense, but if you
*intend* to break if somebody doesn't do write-to-write exclusion,
that's certainly possible.

The fact that we haven't given the atomicity guarantees wrt reads does
imply that nobody can do this kinds of crazy thing, but it's an
implication, not a guarantee.

I really don't think such an odd load is sensible (except for the
special case of O_APPEND records, which definitely is sensible), and
it is certainly solvable.

For example, a purely "local lock" model would be to just lock all
pages in order as you write them, and not unlock the previous page
until you've locked the next one.

That is a really simple model that doesn't require any range locking
or anything like that because it simply relies on all writes always
being done strictly in file position order.

But you'd have to be very careful with deadlocks anyway in case there
are other cases of multi-page locks. And even without deadlocks, you
might end up having just a lot more lock contention (nested locks can
have *much* worse contention than sequential ones)

There are other models with multi-level locking, but I think we'd like
to try to keep things simple if we change something core like this.

               Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27 22:46       ` Linus Torvalds
@ 2024-02-27 23:00         ` Linus Torvalds
  2024-02-28  2:22         ` Kent Overstreet
  2024-02-28 18:18         ` Kent Overstreet
  2 siblings, 0 replies; 87+ messages in thread
From: Linus Torvalds @ 2024-02-27 23:00 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Dave Chinner, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm,
	Daniel Gomez, Pankaj Raghav, Jens Axboe, Christoph Hellwig,
	Chris Mason, Johannes Weiner, Matthew Wilcox

On Tue, 27 Feb 2024 at 14:46, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> For example, a purely "local lock" model would be to just lock all
> pages in order as you write them, and not unlock the previous page
> until you've locked the next one.

Thinking more about this, maybe that doesn't really guarantee much.
The final state would be that even with concurrent overlapping writes,
one writer did its overlapping write fully and you'd never have mixed
ABAB kind of results, but you could still have concurrent readers see
the two writes progressing concurrently.

Of course, since readers aren't serialized as-is, I'm not sure if
"readers can see intermediate state" is anything new or relevant
anyway.

Maybe the worry isn't worth it.

                Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27 22:46       ` Linus Torvalds
  2024-02-27 23:00         ` Linus Torvalds
@ 2024-02-28  2:22         ` Kent Overstreet
  2024-02-28  3:00           ` Matthew Wilcox
  2024-02-28 18:18         ` Kent Overstreet
  2 siblings, 1 reply; 87+ messages in thread
From: Kent Overstreet @ 2024-02-28  2:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm,
	Daniel Gomez, Pankaj Raghav, Jens Axboe, Christoph Hellwig,
	Chris Mason, Johannes Weiner, Matthew Wilcox

On Tue, Feb 27, 2024 at 02:46:11PM -0800, Linus Torvalds wrote:
> On Tue, 27 Feb 2024 at 14:21, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > ext4 code doesn't do that. it takes the inode lock in exclusive mode,
> > just like everyone else.
> 
> Not for dio, it doesn't.
> 
> > > The real question is how much of userspace will that break, because
> > > of implicit assumptions that the kernel has always serialised
> > > buffered writes?
> >
> > What would break?
> 
> Well, at least in theory you could have concurrent overlapping writes
> of folio crossing records, and currently you do get the guarantee that
> one or the other record is written, but relying just on page locking
> would mean that you might get a mix of them at page boundaries.
> 
> I'm not sure that such a model would make any sense, but if you
> *intend* to break if somebody doesn't do write-to-write exclusion,
> that's certainly possible.
> 
> The fact that we haven't given the atomicity guarantees wrt reads does
> imply that nobody can do this kinds of crazy thing, but it's an
> implication, not a guarantee.
> 
> I really don't think such an odd load is sensible (except for the
> special case of O_APPEND records, which definitely is sensible), and
> it is certainly solvable.
> 
> For example, a purely "local lock" model would be to just lock all
> pages in order as you write them, and not unlock the previous page
> until you've locked the next one.

The code I'm testing locks _all_ the folios we're writing to
simultaneously, and if they can't all be pinned and locked just falls
back to the inode lock.

Which does raise the question of if we've ever attempted to define a
lock ordering on folios. I suspect not, since folio lock doesn't even
seem to have lockdep support.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-28  2:22         ` Kent Overstreet
@ 2024-02-28  3:00           ` Matthew Wilcox
  2024-02-28  4:22             ` Matthew Wilcox
  0 siblings, 1 reply; 87+ messages in thread
From: Matthew Wilcox @ 2024-02-28  3:00 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Linus Torvalds, Dave Chinner, Luis Chamberlain, lsf-pc,
	linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Tue, Feb 27, 2024 at 09:22:26PM -0500, Kent Overstreet wrote:
> Which does raise the question of if we've ever attempted to define a
> lock ordering on folios. I suspect not, since folio lock doesn't even
> seem to have lockdep support.

We even wrote it down!

                /*
                 * To avoid deadlocks between range_cyclic writeback and callers
                 * that hold pages in PageWriteback to aggregate I/O until
                 * the writeback iteration finishes, we do not loop back to the
                 * start of the file.  Doing so causes a page lock/page
                 * writeback access order inversion - we should only ever lock
                 * multiple pages in ascending page->index order, and looping
                 * back to the start of the file violates that rule and causes
                 * deadlocks.
                 */

(I'll take the AR to put this somewhere better like in the folio_lock()
kernel-doc)

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-28  3:00           ` Matthew Wilcox
@ 2024-02-28  4:22             ` Matthew Wilcox
  2024-02-28 17:34               ` Kent Overstreet
  0 siblings, 1 reply; 87+ messages in thread
From: Matthew Wilcox @ 2024-02-28  4:22 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Linus Torvalds, Dave Chinner, Luis Chamberlain, lsf-pc,
	linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Wed, Feb 28, 2024 at 03:00:36AM +0000, Matthew Wilcox wrote:
> On Tue, Feb 27, 2024 at 09:22:26PM -0500, Kent Overstreet wrote:
> > Which does raise the question of if we've ever attempted to define a
> > lock ordering on folios. I suspect not, since folio lock doesn't even
> > seem to have lockdep support.
> 
> We even wrote it down!
> 
>                 /*
>                  * To avoid deadlocks between range_cyclic writeback and callers
>                  * that hold pages in PageWriteback to aggregate I/O until
>                  * the writeback iteration finishes, we do not loop back to the
>                  * start of the file.  Doing so causes a page lock/page
>                  * writeback access order inversion - we should only ever lock
>                  * multiple pages in ascending page->index order, and looping
>                  * back to the start of the file violates that rule and causes
>                  * deadlocks.
>                  */
> 
> (I'll take the AR to put this somewhere better like in the folio_lock()
> kernel-doc)

Um.  I already did.

 * Context: May sleep.  If you need to acquire the locks of two or
 * more folios, they must be in order of ascending index, if they are
 * in the same address_space.  If they are in different address_spaces,
 * acquire the lock of the folio which belongs to the address_space which
 * has the lowest address in memory first.

Where should I have put this information that you would have found it,
if not in the kernel-doc for folio_lock()?

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27 22:42       ` Dave Chinner
@ 2024-02-28  7:48         ` Amir Goldstein
  2024-02-28 14:01           ` Chris Mason
  2024-02-29  0:25           ` Dave Chinner
  0 siblings, 2 replies; 87+ messages in thread
From: Amir Goldstein @ 2024-02-28  7:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kent Overstreet, Pankaj Raghav, Jens Axboe, Chris Mason,
	Matthew Wilcox, Daniel Gomez, linux-mm, Luis Chamberlain,
	Johannes Weiner, linux-fsdevel, lsf-pc, Linus Torvalds,
	Christoph Hellwig, Josef Bacik, Jan Kara

On Wed, Feb 28, 2024 at 12:42 AM Dave Chinner via Lsf-pc
<lsf-pc@lists.linux-foundation.org> wrote:
>
> On Tue, Feb 27, 2024 at 05:21:20PM -0500, Kent Overstreet wrote:
> > On Wed, Feb 28, 2024 at 09:13:05AM +1100, Dave Chinner wrote:
> > > On Tue, Feb 27, 2024 at 05:07:30AM -0500, Kent Overstreet wrote:
> > > > AFAIK every filesystem allows concurrent direct writes, not just xfs,
> > > > it's _buffered_ writes that we care about here.
> > >
> > > We could do concurrent buffered writes in XFS - we would just use
> > > the same locking strategy as direct IO and fall back on folio locks
> > > for copy-in exclusion like ext4 does.
> >
> > ext4 code doesn't do that. it takes the inode lock in exclusive mode,
> > just like everyone else.
>
> Uhuh. ext4 does allow concurrent DIO writes. It's just much more
> constrained than XFS. See ext4_dio_write_checks().
>
> > > The real question is how much of userspace will that break, because
> > > of implicit assumptions that the kernel has always serialised
> > > buffered writes?
> >
> > What would break?
>
> Good question. If you don't know the answer, then you've got the
> same problem as I have. i.e. we don't know if concurrent
> applications that use buffered IO extensively (eg. postgres) assume
> data coherency because of the implicit serialisation occurring
> during buffered IO writes?
>
> > > > If we do a short write because of a page fault (despite previously
> > > > faulting in the userspace buffer), there is no way to completely prevent
> > > > torn writes an atomicity breakage; we could at least try a trylock on
> > > > the inode lock, I didn't do that here.
> > >
> > > As soon as we go for concurrent writes, we give up on any concept of
> > > atomicity of buffered writes (esp. w.r.t reads), so this really
> > > doesn't matter at all.
> >
> > We've already given up buffered write vs. read atomicity, have for a
> > long time - buffered read path takes no locks.
>
> We still have explicit buffered read() vs buffered write() atomicity
> in XFS via buffered reads taking the inode lock shared (see
> xfs_file_buffered_read()) because that's what POSIX says we should
> have.
>
> Essentially, we need to explicitly give POSIX the big finger and
> state that there are no atomicity guarantees given for write() calls
> of any size, nor are there any guarantees for data coherency for
> any overlapping concurrent buffered IO operations.
>

I have disabled read vs. write atomicity (out-of-tree) to make xfs behave
as the other fs ever since Jan has added the invalidate_lock and I believe
that Meta kernel has done that way before.

> Those are things we haven't completely given up yet w.r.t. buffered
> IO, and enabling concurrent buffered writes will expose to users.
> So we need to have explicit policies for this and document them
> clearly in all the places that application developers might look
> for behavioural hints.

That's doable - I can try to do that.
What is your take regarding opt-in/opt-out of legacy behavior?
At the time, I have proposed POSIX_FADV_TORN_RW API [1]
to opt-out of the legacy POSIX behavior, but I guess that an xfs mount
option would make more sense for consistent and clear semantics across
the fs - it is easier if all buffered IO to inode behaved the same way.

Thanks,
Amir.

[1] https://lore.kernel.org/linux-xfs/CAOQ4uxguwnx4AxXqp_zjg39ZUaTGJEM2wNUPnNdtiqV2Q9woqA@mail.gmail.com/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-28  7:48         ` [Lsf-pc] " Amir Goldstein
@ 2024-02-28 14:01           ` Chris Mason
  2024-02-29  0:25           ` Dave Chinner
  1 sibling, 0 replies; 87+ messages in thread
From: Chris Mason @ 2024-02-28 14:01 UTC (permalink / raw)
  To: Amir Goldstein, Dave Chinner
  Cc: Kent Overstreet, Pankaj Raghav, Jens Axboe, Chris Mason,
	Matthew Wilcox, Daniel Gomez, linux-mm, Luis Chamberlain,
	Johannes Weiner, linux-fsdevel, lsf-pc, Linus Torvalds,
	Christoph Hellwig, Josef Bacik, Jan Kara



On 2/28/24 2:48 AM, Amir Goldstein wrote:
> On Wed, Feb 28, 2024 at 12:42 AM Dave Chinner via Lsf-pc

>> Essentially, we need to explicitly give POSIX the big finger and
>> state that there are no atomicity guarantees given for write() calls
>> of any size, nor are there any guarantees for data coherency for
>> any overlapping concurrent buffered IO operations.
>>
> 
> I have disabled read vs. write atomicity (out-of-tree) to make xfs behave
> as the other fs ever since Jan has added the invalidate_lock and I believe
> that Meta kernel has done that way before.

Hmmm, you might be thinking of my patch to prevent kswapd from getting
stuck on XFS inode reclaim, but I don't think we've ever messed with
write concurrency.  I'm comfortable with the concurrency change in
general, but it's not somewhere I'd be excited about differing from
upstream.

Total tangent, but we only carry two XFS patches right now that aren't
upstream.  I dropped the inode reclaim patch; the problem stopped
showing up in our profiles, and the impacted workloads changed to
rocksdb for other reasons.

We flip XFS discards back to synchronous.  Async disards without any
kind of metering saturate drives when we do bulk deletes, leading to
latency spikes on reads and writes.  There's probably a class of flash
that can handle this, but we don't have it.

Unfortunately I also disable large folios on XFS.  They are corrupting
xarrays on our v5.19 kernel, with large folios from multiple files
interleaved together in the same file.  We'll try again with them on
v6.4 or maybe v6.8, but the repro needs thousands of machines making NFS
noises just to trigger one failure, and I won't be able to debug it
until I can make a more reasonable reproduction.

-chris

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-28  4:22             ` Matthew Wilcox
@ 2024-02-28 17:34               ` Kent Overstreet
  2024-02-28 18:04                 ` Matthew Wilcox
  0 siblings, 1 reply; 87+ messages in thread
From: Kent Overstreet @ 2024-02-28 17:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Dave Chinner, Luis Chamberlain, lsf-pc,
	linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Wed, Feb 28, 2024 at 04:22:32AM +0000, Matthew Wilcox wrote:
> On Wed, Feb 28, 2024 at 03:00:36AM +0000, Matthew Wilcox wrote:
> > On Tue, Feb 27, 2024 at 09:22:26PM -0500, Kent Overstreet wrote:
> > > Which does raise the question of if we've ever attempted to define a
> > > lock ordering on folios. I suspect not, since folio lock doesn't even
> > > seem to have lockdep support.
> > 
> > We even wrote it down!
> > 
> >                 /*
> >                  * To avoid deadlocks between range_cyclic writeback and callers
> >                  * that hold pages in PageWriteback to aggregate I/O until
> >                  * the writeback iteration finishes, we do not loop back to the
> >                  * start of the file.  Doing so causes a page lock/page
> >                  * writeback access order inversion - we should only ever lock
> >                  * multiple pages in ascending page->index order, and looping
> >                  * back to the start of the file violates that rule and causes
> >                  * deadlocks.
> >                  */
> > 
> > (I'll take the AR to put this somewhere better like in the folio_lock()
> > kernel-doc)
> 
> Um.  I already did.
> 
>  * Context: May sleep.  If you need to acquire the locks of two or
>  * more folios, they must be in order of ascending index, if they are
>  * in the same address_space.  If they are in different address_spaces,
>  * acquire the lock of the folio which belongs to the address_space which
>  * has the lowest address in memory first.
> 
> Where should I have put this information that you would have found it,
> if not in the kernel-doc for folio_lock()?

I should have seen that :)

But even better would be if we could get lockdep to support folio locks,
then we could define a lockdep comparison function. Of course, there's
no place to stick a lockdep map, but I think technically lockdep could
do everything it needs to do without state in the lock itself, if only
that code didn't make my eyes bleed whenever I have to dig into it...

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-28 17:34               ` Kent Overstreet
@ 2024-02-28 18:04                 ` Matthew Wilcox
  0 siblings, 0 replies; 87+ messages in thread
From: Matthew Wilcox @ 2024-02-28 18:04 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Linus Torvalds, Dave Chinner, Luis Chamberlain, lsf-pc,
	linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Wed, Feb 28, 2024 at 12:34:41PM -0500, Kent Overstreet wrote:
> But even better would be if we could get lockdep to support folio locks,
> then we could define a lockdep comparison function. Of course, there's
> no place to stick a lockdep map, but I think technically lockdep could
> do everything it needs to do without state in the lock itself, if only
> that code didn't make my eyes bleed whenever I have to dig into it...

I don't think lockdep can support folio locks.  They're taken in process
context and (for ->read_folio) released in interrupt context.  Also we
can return to userspace while holding folio locks (I believe ...)

DEPT claims to be able to handle them, but I haven't looked in detail.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27 22:46       ` Linus Torvalds
  2024-02-27 23:00         ` Linus Torvalds
  2024-02-28  2:22         ` Kent Overstreet
@ 2024-02-28 18:18         ` Kent Overstreet
  2024-02-28 19:09           ` Linus Torvalds
  2 siblings, 1 reply; 87+ messages in thread
From: Kent Overstreet @ 2024-02-28 18:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm,
	Daniel Gomez, Pankaj Raghav, Jens Axboe, Christoph Hellwig,
	Chris Mason, Johannes Weiner, Matthew Wilcox

On Tue, Feb 27, 2024 at 02:46:11PM -0800, Linus Torvalds wrote:
> On Tue, 27 Feb 2024 at 14:21, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > ext4 code doesn't do that. it takes the inode lock in exclusive mode,
> > just like everyone else.
> 
> Not for dio, it doesn't.
> 
> > > The real question is how much of userspace will that break, because
> > > of implicit assumptions that the kernel has always serialised
> > > buffered writes?
> >
> > What would break?
> 
> Well, at least in theory you could have concurrent overlapping writes
> of folio crossing records, and currently you do get the guarantee that
> one or the other record is written, but relying just on page locking
> would mean that you might get a mix of them at page boundaries.

I think we can keep that guarantee.

The tricky case was -EFAULT from copy_from_user_nofault(), where we have
to bail out, drop locks, re-fault in the user buffer - and redo the rest
of the write, this time holding the inode lock.

We can't guarantee that partial writes don't happen, but what we can do
is restart the write from the beginning, so the partial write gets
overwritten with a full atomic write.

This way after writes complete we'll never have weird torn writes left
around.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-28 18:18         ` Kent Overstreet
@ 2024-02-28 19:09           ` Linus Torvalds
  2024-02-28 19:29             ` Kent Overstreet
  0 siblings, 1 reply; 87+ messages in thread
From: Linus Torvalds @ 2024-02-28 19:09 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Dave Chinner, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm,
	Daniel Gomez, Pankaj Raghav, Jens Axboe, Christoph Hellwig,
	Chris Mason, Johannes Weiner, Matthew Wilcox

On Wed, 28 Feb 2024 at 10:18, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> I think we can keep that guarantee.
>
> The tricky case was -EFAULT from copy_from_user_nofault(), where we have
> to bail out, drop locks, re-fault in the user buffer - and redo the rest
> of the write, this time holding the inode lock.
>
> We can't guarantee that partial writes don't happen, but what we can do
> is restart the write from the beginning, so the partial write gets
> overwritten with a full atomic write.

I think that's a solution that is actually much worse than the thing
it is trying to solve.

Now a concurrent reader can actually see the data change twice or
more. Either because there's another writer that came in in between,
or because of threaded modifications to the source buffer in the first
writer.

So your solution actually makes for noticeably *worse* atomicity
guarantees, not better.

Not the solution. Not at all.

I do think the solution is to just take the inode lock exclusive (when
we have to modify the inode size or the suid/sgid) or shared (to
prevent concurrent i_size modifications), and leave it at that.

And  we should probably do a mount flag (for defaults) and an
open-time flag (for specific uses) to let people opt in to this
behavior.

            Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-28 19:09           ` Linus Torvalds
@ 2024-02-28 19:29             ` Kent Overstreet
  2024-02-28 20:17               ` Linus Torvalds
  0 siblings, 1 reply; 87+ messages in thread
From: Kent Overstreet @ 2024-02-28 19:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm,
	Daniel Gomez, Pankaj Raghav, Jens Axboe, Christoph Hellwig,
	Chris Mason, Johannes Weiner, Matthew Wilcox

On Wed, Feb 28, 2024 at 11:09:53AM -0800, Linus Torvalds wrote:
> On Wed, 28 Feb 2024 at 10:18, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > I think we can keep that guarantee.
> >
> > The tricky case was -EFAULT from copy_from_user_nofault(), where we have
> > to bail out, drop locks, re-fault in the user buffer - and redo the rest
> > of the write, this time holding the inode lock.
> >
> > We can't guarantee that partial writes don't happen, but what we can do
> > is restart the write from the beginning, so the partial write gets
> > overwritten with a full atomic write.
> 
> I think that's a solution that is actually much worse than the thing
> it is trying to solve.
> 
> Now a concurrent reader can actually see the data change twice or
> more. Either because there's another writer that came in in between,
> or because of threaded modifications to the source buffer in the first
> writer.

Yeah, that's a wrinkle.

But that requires a three way race to observe; a race between a reader
and multiple writers, and the reader doing multiple reads.

The more concerning sitution to me would be if breaking write atomicity
means that we end up with data in the file that doesn't correspond to an
total ordering of writes; e.g. part of write a, then write b, then the
rest of write a overlaying part of write b.

Maybe that can't happen as long as writes are always happening in
ascending folio order?

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-28 19:29             ` Kent Overstreet
@ 2024-02-28 20:17               ` Linus Torvalds
  2024-02-28 23:21                 ` Kent Overstreet
  0 siblings, 1 reply; 87+ messages in thread
From: Linus Torvalds @ 2024-02-28 20:17 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Dave Chinner, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm,
	Daniel Gomez, Pankaj Raghav, Jens Axboe, Christoph Hellwig,
	Chris Mason, Johannes Weiner, Matthew Wilcox

On Wed, 28 Feb 2024 at 11:29, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> The more concerning sitution to me would be if breaking write atomicity
> means that we end up with data in the file that doesn't correspond to an
> total ordering of writes; e.g. part of write a, then write b, then the
> rest of write a overlaying part of write b.
>
> Maybe that can't happen as long as writes are always happening in
> ascending folio order?

So that was what my suggestion about just overlapping one lock at a
time was about - we always do writes in ascending order, and
contiguously (even if the data source obviously isn't necessarily some
contiguous thing).

And I think that's actually fundamental and not something we're likely
to get out of. If you do non-contiguous writes, you'll always treat
them as separate things.

Then just the "lock the next folio before unlocking the previous one"
would already give some relevant guarantees, in that at least you
wouldn't get overlapping writes where the write data would be mixed
up.

So you'd get *some* ordering, and while I wouldn't call it "total
ordering" (and I think with readers not taking locks you can't get
that anyway because readers will *always* see partial writes), I think
it's much better than some re-write model.

However, the "lock consecutive pages as you go along" does still have
the issue of "you have to be able to take a page fault in the middle".

And right now that actually depends on us always dropping the page
lock in between iterations.

This issue is solvable - if you get a partial read while you hold a
page lock, you can always just see "are the pages I hold locks on not
up-to-date?" And decide to do the read yourself (and mark them
up-to-date). We don't do that right now because it's simpler not to,
but it's not conceptually a huge problem.

It *is* a practical problem, though.

For example, in generic_perform_write(), we've left page locking on
writes to the filesystems (ie it's done by
"->write_begin/->write_end"), so I think in reality that "don't
release the lock on folio N until after you've taken the lock on folio
N+1" isn't actually wonderful. It has some really nasty practical
issues.

And yes, those practical issues are because of historical mistakes
(some of them very much by yours truly). Having one single "page lock"
was probably one of those historical mistakes. If we use a different
bit for serializing page writers, the above problem would go away as
an issue.

ANYWAY.

At least with the current setup we literally depend on that "one page
at a time" behavior right now, and even XFS - which takes the inode
lock shared for reading - does *not* do it for reading a page during a
page fault for all these reasons.

XFS uses iomap_write_iter() instead of generic_perform_write(), but
the solution there is exactly the same, and the issue is fairly
fundamental (unless you want to always read in pages that you are
going to overwrite first).

This is also one of the (many) reasons I think the POSIX atomicity
model is complete garbage. I think the whole "reads are atomic with
respect to writes" is a feel-good bedtime story. It's simplistic, and
it's just not *real*, because it's basically not compatible with mmap.

So the whole POSIX atomicity story comes from a historical
implementation background and ignores mmap.

Fine, people can live in that kind of "read and write without DIO is
special" fairy tale and think that paper standards are more important
than sane semantics. But I just am not a fan of that.

So I am personally perfectly happy to say "POSIX atomicity is a stupid
paper standard that has no relevance for reality". The read side
*cannot* be atomic wrt the write side.

But at the same time, I obviously then care a _lot_ about actual
existing loads. I may not worry about some POSIX atomicity guarantees,
but I *do* worry about real loads.

And I don't think real loads actually care about concurrent
overlapping writes at all, but the "I don't think" may be another
wishful feel-good bedtime story that isn't based on reality ;)

              Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-28 20:17               ` Linus Torvalds
@ 2024-02-28 23:21                 ` Kent Overstreet
  0 siblings, 0 replies; 87+ messages in thread
From: Kent Overstreet @ 2024-02-28 23:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm,
	Daniel Gomez, Pankaj Raghav, Jens Axboe, Christoph Hellwig,
	Chris Mason, Johannes Weiner, Matthew Wilcox

On Wed, Feb 28, 2024 at 12:17:50PM -0800, Linus Torvalds wrote:
> On Wed, 28 Feb 2024 at 11:29, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > The more concerning sitution to me would be if breaking write atomicity
> > means that we end up with data in the file that doesn't correspond to an
> > total ordering of writes; e.g. part of write a, then write b, then the
> > rest of write a overlaying part of write b.
> >
> > Maybe that can't happen as long as writes are always happening in
> > ascending folio order?
> 
> So that was what my suggestion about just overlapping one lock at a
> time was about - we always do writes in ascending order, and
> contiguously (even if the data source obviously isn't necessarily some
> contiguous thing).
> 
> And I think that's actually fundamental and not something we're likely
> to get out of. If you do non-contiguous writes, you'll always treat
> them as separate things.
> 
> Then just the "lock the next folio before unlocking the previous one"
> would already give some relevant guarantees, in that at least you
> wouldn't get overlapping writes where the write data would be mixed
> up.
> 
> So you'd get *some* ordering, and while I wouldn't call it "total
> ordering" (and I think with readers not taking locks you can't get
> that anyway because readers will *always* see partial writes), I think
> it's much better than some re-write model.

Well, the re-write model is for the case where we have to bail out, drop
locks and re-fault, and that's no different here. But that's a side
issue.

The "just take the next lock before dropping the previous lock" approach
- I agree that works just as well. I'm still trying to come up with a
satisfying proof as to why - but one writer can't pass another writer,
it's basically that.

> 
> However, the "lock consecutive pages as you go along" does still have
> the issue of "you have to be able to take a page fault in the middle".
> 
> And right now that actually depends on us always dropping the page
> lock in between iterations.
> 
> This issue is solvable - if you get a partial read while you hold a
> page lock, you can always just see "are the pages I hold locks on not
> up-to-date?" And decide to do the read yourself (and mark them
> up-to-date). We don't do that right now because it's simpler not to,
> but it's not conceptually a huge problem.

Hang on, I think you're talking about two different issues. The "oops,
copying from the _user_ buffer faulted" is the more annoying one here.
If the destination folio wasn't uptodate, we just read that inline, we
don't drop locks for that (at least my code doesn't; I'd have to have a
look at filemap.c and iomap for those versions).

There is another funny corner case where the destination folio wasn't
uptodate, and we decided we didn't need to read it in because we were
going to be fully overwriting it, but then we _didn't_ fully overwrite
it - then we have to revert that part of the write.

But if that happened it was because the useer copy didn't copy as much
as we expected, i.e. it would've needed to take a page fault, so we have
to bail out, drop locks and refault anyways.

As an aside, the "copy_to_(from|user) may recurse into arbitrary code
and take arbitrary locks" is some scary stuff, and I very much doubt
it's handled correctly everywhere. If I ever do the ioctl v2 proposal I
wrote up awhile back, I think part of that should be doubble buffering
approximately at the syscall layer so that we're not doing that from
within rando driver/fs code holding rando locks; the 1% of cases that
care about performance can do something special, but most shouldn't.

> It *is* a practical problem, though.
> 
> For example, in generic_perform_write(), we've left page locking on
> writes to the filesystems (ie it's done by
> "->write_begin/->write_end"), so I think in reality that "don't
> release the lock on folio N until after you've taken the lock on folio
> N+1" isn't actually wonderful. It has some really nasty practical
> issues.

yeah, write_begin and write_end are... odd, to say the least.

> And yes, those practical issues are because of historical mistakes
> (some of them very much by yours truly). Having one single "page lock"
> was probably one of those historical mistakes. If we use a different
> bit for serializing page writers, the above problem would go away as
> an issue.

well, it's a difficult situation, a _lot_ of different code ends up
wanting to communicate state machine style through page flags.

but yeah, we're not clear on whether folio lock protects
 - the data within the folio?
 - is it used for signalling read completion? (sometimes!)
 - is it used for guarding against against truncate/invalidate? There's
   multiple different truncate/invalidate paths, for doing different
   things based on folio dirty/writeback and perhaps locked
 - does it protect other filesystem state hanging off folio->private?
   you might thing so, but I came to the conclusion that that was a bad
   idea

> ANYWAY.
> 
> At least with the current setup we literally depend on that "one page
> at a time" behavior right now, and even XFS - which takes the inode
> lock shared for reading - does *not* do it for reading a page during a
> page fault for all these reasons.
> 
> XFS uses iomap_write_iter() instead of generic_perform_write(), but
> the solution there is exactly the same, and the issue is fairly
> fundamental (unless you want to always read in pages that you are
> going to overwrite first).
> 
> This is also one of the (many) reasons I think the POSIX atomicity
> model is complete garbage. I think the whole "reads are atomic with
> respect to writes" is a feel-good bedtime story. It's simplistic, and
> it's just not *real*, because it's basically not compatible with mmap.

oh, I wouldn't go that far.

I think the real problem is just undocumented behaviour that differs
across filesystems, and we've got too many filesystems doing their own
thing where their smart stuff should've been lifted up to the vfs (and
yes, I am a guilty party here too).

And another part of what makes this stuff hard is that it's always such
a goddamn hassle to extend userspace interfaces, even just to add a
couple flags. People have a real fear of that stuff, and when things are
proposed they end up getting bikeshed to death, and then we _overdesign_
stuff so that we don't have to repeat that process... no. Make it easy
to add new userspace interfaces, do it in small bite sized chunks, one
thing at a time, learn from our mistakes, depracate and add new
revisions for the stuff that turned out to be crap.

...What we really want here is just some flags so that userspace can
specify what locking model they want. We're clearly capable of
implementing multiple locking models, and the real nightmare scenario is
some userspace program depending on stupidly subtle atomicity guarantees
that only one filesystem implements AND THERE'S NOTHING TO GREP FOR IN
THE CODE TO FIND IT.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27 17:58                                                     ` Paul E. McKenney
@ 2024-02-28 23:55                                                       ` Kent Overstreet
  2024-02-29 19:42                                                         ` Paul E. McKenney
  0 siblings, 1 reply; 87+ messages in thread
From: Kent Overstreet @ 2024-02-28 23:55 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain,
	lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Tue, Feb 27, 2024 at 09:58:48AM -0800, Paul E. McKenney wrote:
> On Tue, Feb 27, 2024 at 11:34:06AM -0500, Kent Overstreet wrote:
> > On Tue, Feb 27, 2024 at 08:21:29AM -0800, Paul E. McKenney wrote:
> > > On Tue, Feb 27, 2024 at 03:54:23PM +0000, Matthew Wilcox wrote:
> > > > On Tue, Feb 27, 2024 at 07:32:32AM -0800, Paul E. McKenney wrote:
> > > > > At a ridiculously high level, reclaim is looking for memory to free.
> > > > > Some read-only memory can often be dropped immediately on the grounds
> > > > > that its data can be read back in if needed.  Other memory can only be
> > > > > dropped after being written out, which involves a delay.  There are of
> > > > > course many other complications, but this will do for a start.
> > > > 
> > > > Hi Paul,
> > > > 
> > > > I appreciate the necessity of describing what's going on at a very high
> > > > level, but there's a wrinkle that I'm not sure you're aware of which
> > > > may substantially change your argument.
> > > > 
> > > > For anonymous memory, we do indeed wait until reclaim to start writing it
> > > > to swap.  That may or may not be the right approach given how anonymous
> > > > memory is used (and could be the topic of an interesting discussion
> > > > at LSFMM).
> > > > 
> > > > For file-backed memory, we do not write back memory in reclaim.  If it
> > > > has got to the point of calling ->writepage in vmscan, things have gone
> > > > horribly wrong to the point where calling ->writepage will make things
> > > > worse.  This is why we're currently removing ->writepage from every
> > > > filesystem (only ->writepages will remain).  Instead, the page cache
> > > > is written back much earlier, once we get to balance_dirty_pages().
> > > > That lets us write pages in filesystem-friendly ways instead of in MM
> > > > LRU order.
> > > 
> > > Thank you for the additional details.
> > > 
> > > But please allow me to further summarize the point of my prior email
> > > that seems to be getting lost:
> > > 
> > > 1.	RCU already does significant work prodding grace periods.
> > > 
> > > 2.	There is no reasonable way to provide estimates of the
> > > 	memory sent to RCU via call_rcu(), and in many cases
> > > 	the bulk of the waiting memory will be call_rcu() memory.
> > > 
> > > Therefore, if we cannot come up with a heuristic that does not need to
> > > know the bytes of memory waiting, we are stuck anyway.
> > 
> > That is a completely asinine argument.
> 
> Huh.  Anything else you need to get off your chest?
> 
> On the off-chance it is unclear, I do disagree with your assessment.
> 
> > > So perhaps the proper heuristic for RCU speeding things up is simply
> > > "Hey RCU, we are in reclaim!".
> > 
> > Because that's the wrong heuristic. There are important workloads for
> > which  we're _always_ in reclaim, but as long as RCU grace periods are
> > happening at some steady rate, the amount of memory stranded will be
> > bounded and there's no reason to expedite grace periods.
> 
> RCU is in fact designed to handle heavy load, and contains a number of
> mechanisms to operate more efficiently at higher load than at lower load.
> It also contains mechanisms to expedite grace periods under heavy load.
> Some of which I already described in earlier emails on this thread.

yeah, the synchronize_rcu_expedited() souns like exactly what we need
here, when memory reclaim notices too much memory is stranded

> > If we start RCU freeing all pagecache folios we're going to be cycling
> > memory through RCU freeing at the rate of gigabytes per second, tens of
> > gigabytes per second on high end systems.
> 
> The load on RCU would be measured in terms of requests (kfree_rcu() and
> friends) per unit time per CPU, not in terms of gigabytes per unit time.
> Of course, the amount of memory per unit time might be an issue for
> whatever allocators you are using, and as Matthew has often pointed out,
> the deferred reclamation incurs additional cache misses.

So what I'm saying is that in the absensce of something noticing
excessive memory being stranded and asking for an expedited grace
period, the only bounds on the amount of memory being stranded will be
how often RCU grace periods expire in the absence of anyone asking for
them - that was my question to you. I'm not at all knowledgable on RCU
internals but I gather it varies with things like dynticks and whether
or not userspace is calling into the kernel?

"gigabytes per second" - if userspace is doing big sequential streaming
reads that don't fit into cache, we'll be evicting pagecache as quickly
as we can read it in, so we should only be limited by your SSD
bandwidth.

> And rcutorture really does do forward-progress testing on vanilla RCU
> that features tight in-kernel loops doing call_rcu() without bounds on
> memory in flight, and RCU routinely survives this in a VM that is given
> only 512MB of memory.  In fact, any failure to survive this is considered
> a bug to be fixed.

Are you saying there's already feedback between memory reclaim and RCU?

> So I suspect RCU is quite capable of handling this load.  But how many
> additional kfree_rcu() calls are you anticipating per unit time per CPU?
> For example, could we simply measure the rate at which pagecache folios
> are currently being freed under heavy load?  Given that information,
> we could just try it out.

It's not the load I'm concerned about, or the number of call_rcu()
calls, I have no doubt that RCU will cope with that just fine.

But I do think that we need an additional feedback mechanism here. When
we're doing big streaming sequential buffered IO, and lots of memory is
cycled in and out of the pagecache, we have a couple things we want to
avoid:

The main thing is that we don't want the amount of memory stranded
waiting for RCU to grow unbounded and shove everything out of the
caches; if you're currently just concerned about _deadlock_ that is
likely insufficient here, we're also concerned about maintaining good
steady performance under load

We don't want memory reclaim to be trying harder and harder when the
correct thing to do is a synchronize_rcu_expedited().

We _also_ don't want to be hammering on RCU asking for expedited grace
periods unnecessarily when the number of pending callbacks is high, but
they're all for unrelated stuff - expedited RCU grace periods aren't
free either!.

Does that help clarify things?

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-28  7:48         ` [Lsf-pc] " Amir Goldstein
  2024-02-28 14:01           ` Chris Mason
@ 2024-02-29  0:25           ` Dave Chinner
  2024-02-29  0:57             ` Kent Overstreet
  1 sibling, 1 reply; 87+ messages in thread
From: Dave Chinner @ 2024-02-29  0:25 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Kent Overstreet, Pankaj Raghav, Jens Axboe, Chris Mason,
	Matthew Wilcox, Daniel Gomez, linux-mm, Luis Chamberlain,
	Johannes Weiner, linux-fsdevel, lsf-pc, Linus Torvalds,
	Christoph Hellwig, Josef Bacik, Jan Kara

On Wed, Feb 28, 2024 at 09:48:46AM +0200, Amir Goldstein wrote:
> On Wed, Feb 28, 2024 at 12:42 AM Dave Chinner via Lsf-pc
> <lsf-pc@lists.linux-foundation.org> wrote:
> >
> > On Tue, Feb 27, 2024 at 05:21:20PM -0500, Kent Overstreet wrote:
> > > On Wed, Feb 28, 2024 at 09:13:05AM +1100, Dave Chinner wrote:
> > > > On Tue, Feb 27, 2024 at 05:07:30AM -0500, Kent Overstreet wrote:
> > > > > AFAIK every filesystem allows concurrent direct writes, not just xfs,
> > > > > it's _buffered_ writes that we care about here.
> > > >
> > > > We could do concurrent buffered writes in XFS - we would just use
> > > > the same locking strategy as direct IO and fall back on folio locks
> > > > for copy-in exclusion like ext4 does.
> > >
> > > ext4 code doesn't do that. it takes the inode lock in exclusive mode,
> > > just like everyone else.
> >
> > Uhuh. ext4 does allow concurrent DIO writes. It's just much more
> > constrained than XFS. See ext4_dio_write_checks().
> >
> > > > The real question is how much of userspace will that break, because
> > > > of implicit assumptions that the kernel has always serialised
> > > > buffered writes?
> > >
> > > What would break?
> >
> > Good question. If you don't know the answer, then you've got the
> > same problem as I have. i.e. we don't know if concurrent
> > applications that use buffered IO extensively (eg. postgres) assume
> > data coherency because of the implicit serialisation occurring
> > during buffered IO writes?
> >
> > > > > If we do a short write because of a page fault (despite previously
> > > > > faulting in the userspace buffer), there is no way to completely prevent
> > > > > torn writes an atomicity breakage; we could at least try a trylock on
> > > > > the inode lock, I didn't do that here.
> > > >
> > > > As soon as we go for concurrent writes, we give up on any concept of
> > > > atomicity of buffered writes (esp. w.r.t reads), so this really
> > > > doesn't matter at all.
> > >
> > > We've already given up buffered write vs. read atomicity, have for a
> > > long time - buffered read path takes no locks.
> >
> > We still have explicit buffered read() vs buffered write() atomicity
> > in XFS via buffered reads taking the inode lock shared (see
> > xfs_file_buffered_read()) because that's what POSIX says we should
> > have.
> >
> > Essentially, we need to explicitly give POSIX the big finger and
> > state that there are no atomicity guarantees given for write() calls
> > of any size, nor are there any guarantees for data coherency for
> > any overlapping concurrent buffered IO operations.
> >
> 
> I have disabled read vs. write atomicity (out-of-tree) to make xfs behave
> as the other fs ever since Jan has added the invalidate_lock and I believe
> that Meta kernel has done that way before.
> 
> > Those are things we haven't completely given up yet w.r.t. buffered
> > IO, and enabling concurrent buffered writes will expose to users.
> > So we need to have explicit policies for this and document them
> > clearly in all the places that application developers might look
> > for behavioural hints.
> 
> That's doable - I can try to do that.
> What is your take regarding opt-in/opt-out of legacy behavior?

Screw the legacy code, don't even make it an option. No-one should
be relying on large buffered writes being atomic anymore, and with
high order folios in the page cache most small buffered writes are
going to be atomic w.r.t. both reads and writes anyway.

> At the time, I have proposed POSIX_FADV_TORN_RW API [1]
> to opt-out of the legacy POSIX behavior, but I guess that an xfs mount
> option would make more sense for consistent and clear semantics across
> the fs - it is easier if all buffered IO to inode behaved the same way.

No mount options, just change the behaviour. Applications already
have to avoid concurrent overlapping buffered reads and writes if
they care about data integrity and coherency, so making buffered
writes concurrent doesn't change anything.

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-29  0:25           ` Dave Chinner
@ 2024-02-29  0:57             ` Kent Overstreet
  2024-03-04  0:46               ` Dave Chinner
  0 siblings, 1 reply; 87+ messages in thread
From: Kent Overstreet @ 2024-02-29  0:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Amir Goldstein, Pankaj Raghav, Jens Axboe, Chris Mason,
	Matthew Wilcox, Daniel Gomez, linux-mm, Luis Chamberlain,
	Johannes Weiner, linux-fsdevel, lsf-pc, Linus Torvalds,
	Christoph Hellwig, Josef Bacik, Jan Kara

On Thu, Feb 29, 2024 at 11:25:33AM +1100, Dave Chinner wrote:
> On Wed, Feb 28, 2024 at 09:48:46AM +0200, Amir Goldstein wrote:
> > On Wed, Feb 28, 2024 at 12:42 AM Dave Chinner via Lsf-pc
> > <lsf-pc@lists.linux-foundation.org> wrote:
> > >
> > > On Tue, Feb 27, 2024 at 05:21:20PM -0500, Kent Overstreet wrote:
> > > > On Wed, Feb 28, 2024 at 09:13:05AM +1100, Dave Chinner wrote:
> > > > > On Tue, Feb 27, 2024 at 05:07:30AM -0500, Kent Overstreet wrote:
> > > > > > AFAIK every filesystem allows concurrent direct writes, not just xfs,
> > > > > > it's _buffered_ writes that we care about here.
> > > > >
> > > > > We could do concurrent buffered writes in XFS - we would just use
> > > > > the same locking strategy as direct IO and fall back on folio locks
> > > > > for copy-in exclusion like ext4 does.
> > > >
> > > > ext4 code doesn't do that. it takes the inode lock in exclusive mode,
> > > > just like everyone else.
> > >
> > > Uhuh. ext4 does allow concurrent DIO writes. It's just much more
> > > constrained than XFS. See ext4_dio_write_checks().
> > >
> > > > > The real question is how much of userspace will that break, because
> > > > > of implicit assumptions that the kernel has always serialised
> > > > > buffered writes?
> > > >
> > > > What would break?
> > >
> > > Good question. If you don't know the answer, then you've got the
> > > same problem as I have. i.e. we don't know if concurrent
> > > applications that use buffered IO extensively (eg. postgres) assume
> > > data coherency because of the implicit serialisation occurring
> > > during buffered IO writes?
> > >
> > > > > > If we do a short write because of a page fault (despite previously
> > > > > > faulting in the userspace buffer), there is no way to completely prevent
> > > > > > torn writes an atomicity breakage; we could at least try a trylock on
> > > > > > the inode lock, I didn't do that here.
> > > > >
> > > > > As soon as we go for concurrent writes, we give up on any concept of
> > > > > atomicity of buffered writes (esp. w.r.t reads), so this really
> > > > > doesn't matter at all.
> > > >
> > > > We've already given up buffered write vs. read atomicity, have for a
> > > > long time - buffered read path takes no locks.
> > >
> > > We still have explicit buffered read() vs buffered write() atomicity
> > > in XFS via buffered reads taking the inode lock shared (see
> > > xfs_file_buffered_read()) because that's what POSIX says we should
> > > have.
> > >
> > > Essentially, we need to explicitly give POSIX the big finger and
> > > state that there are no atomicity guarantees given for write() calls
> > > of any size, nor are there any guarantees for data coherency for
> > > any overlapping concurrent buffered IO operations.
> > >
> > 
> > I have disabled read vs. write atomicity (out-of-tree) to make xfs behave
> > as the other fs ever since Jan has added the invalidate_lock and I believe
> > that Meta kernel has done that way before.
> > 
> > > Those are things we haven't completely given up yet w.r.t. buffered
> > > IO, and enabling concurrent buffered writes will expose to users.
> > > So we need to have explicit policies for this and document them
> > > clearly in all the places that application developers might look
> > > for behavioural hints.
> > 
> > That's doable - I can try to do that.
> > What is your take regarding opt-in/opt-out of legacy behavior?
> 
> Screw the legacy code, don't even make it an option. No-one should
> be relying on large buffered writes being atomic anymore, and with
> high order folios in the page cache most small buffered writes are
> going to be atomic w.r.t. both reads and writes anyway.

That's a new take...

> 
> > At the time, I have proposed POSIX_FADV_TORN_RW API [1]
> > to opt-out of the legacy POSIX behavior, but I guess that an xfs mount
> > option would make more sense for consistent and clear semantics across
> > the fs - it is easier if all buffered IO to inode behaved the same way.
> 
> No mount options, just change the behaviour. Applications already
> have to avoid concurrent overlapping buffered reads and writes if
> they care about data integrity and coherency, so making buffered
> writes concurrent doesn't change anything.

Honestly - no.

Userspace would really like to see some sort of definition for this kind
of behaviour, and if we just change things underneath them without
telling anyone, _that's a dick move_.

POSIX_FADV_TORN_RW is a terrible name, though.

And fadvise() is the wrong API for this because it applies to ranges,
this should be an open flag or an fcntl.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-28 23:55                                                       ` Kent Overstreet
@ 2024-02-29 19:42                                                         ` Paul E. McKenney
  2024-02-29 20:51                                                           ` Kent Overstreet
  0 siblings, 1 reply; 87+ messages in thread
From: Paul E. McKenney @ 2024-02-29 19:42 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain,
	lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Wed, Feb 28, 2024 at 06:55:12PM -0500, Kent Overstreet wrote:
> On Tue, Feb 27, 2024 at 09:58:48AM -0800, Paul E. McKenney wrote:
> > On Tue, Feb 27, 2024 at 11:34:06AM -0500, Kent Overstreet wrote:
> > > On Tue, Feb 27, 2024 at 08:21:29AM -0800, Paul E. McKenney wrote:
> > > > On Tue, Feb 27, 2024 at 03:54:23PM +0000, Matthew Wilcox wrote:
> > > > > On Tue, Feb 27, 2024 at 07:32:32AM -0800, Paul E. McKenney wrote:
> > > > > > At a ridiculously high level, reclaim is looking for memory to free.
> > > > > > Some read-only memory can often be dropped immediately on the grounds
> > > > > > that its data can be read back in if needed.  Other memory can only be
> > > > > > dropped after being written out, which involves a delay.  There are of
> > > > > > course many other complications, but this will do for a start.
> > > > > 
> > > > > Hi Paul,
> > > > > 
> > > > > I appreciate the necessity of describing what's going on at a very high
> > > > > level, but there's a wrinkle that I'm not sure you're aware of which
> > > > > may substantially change your argument.
> > > > > 
> > > > > For anonymous memory, we do indeed wait until reclaim to start writing it
> > > > > to swap.  That may or may not be the right approach given how anonymous
> > > > > memory is used (and could be the topic of an interesting discussion
> > > > > at LSFMM).
> > > > > 
> > > > > For file-backed memory, we do not write back memory in reclaim.  If it
> > > > > has got to the point of calling ->writepage in vmscan, things have gone
> > > > > horribly wrong to the point where calling ->writepage will make things
> > > > > worse.  This is why we're currently removing ->writepage from every
> > > > > filesystem (only ->writepages will remain).  Instead, the page cache
> > > > > is written back much earlier, once we get to balance_dirty_pages().
> > > > > That lets us write pages in filesystem-friendly ways instead of in MM
> > > > > LRU order.
> > > > 
> > > > Thank you for the additional details.
> > > > 
> > > > But please allow me to further summarize the point of my prior email
> > > > that seems to be getting lost:
> > > > 
> > > > 1.	RCU already does significant work prodding grace periods.
> > > > 
> > > > 2.	There is no reasonable way to provide estimates of the
> > > > 	memory sent to RCU via call_rcu(), and in many cases
> > > > 	the bulk of the waiting memory will be call_rcu() memory.
> > > > 
> > > > Therefore, if we cannot come up with a heuristic that does not need to
> > > > know the bytes of memory waiting, we are stuck anyway.
> > > 
> > > That is a completely asinine argument.
> > 
> > Huh.  Anything else you need to get off your chest?
> > 
> > On the off-chance it is unclear, I do disagree with your assessment.
> > 
> > > > So perhaps the proper heuristic for RCU speeding things up is simply
> > > > "Hey RCU, we are in reclaim!".
> > > 
> > > Because that's the wrong heuristic. There are important workloads for
> > > which  we're _always_ in reclaim, but as long as RCU grace periods are
> > > happening at some steady rate, the amount of memory stranded will be
> > > bounded and there's no reason to expedite grace periods.
> > 
> > RCU is in fact designed to handle heavy load, and contains a number of
> > mechanisms to operate more efficiently at higher load than at lower load.
> > It also contains mechanisms to expedite grace periods under heavy load.
> > Some of which I already described in earlier emails on this thread.
> 
> yeah, the synchronize_rcu_expedited() souns like exactly what we need
> here, when memory reclaim notices too much memory is stranded

I was talking about things that the normal grace periods do go speed
things up when needed, but it might well be the case that this use
case needs to use synchronize_rcu_expedited().  But this function
does have some serious side effects, including lots of IPIs and (in
preemptible kernels) priority boosting of any task that has ever
been preempted within its current RCU read-side critical section.
Plus synchronize_rcu_expedited() cannot help if the problem is overly
long readers.

But one might imagine a solution that used kfree_rcu() and
rcu_barrier() to detect any RCU slowness, which would trigger use of
synchronize_rcu_expedited().  Except that this is harder than it
might immediately appear.

So first, let's find out what the situation really requires.

> > > If we start RCU freeing all pagecache folios we're going to be cycling
> > > memory through RCU freeing at the rate of gigabytes per second, tens of
> > > gigabytes per second on high end systems.
> > 
> > The load on RCU would be measured in terms of requests (kfree_rcu() and
> > friends) per unit time per CPU, not in terms of gigabytes per unit time.
> > Of course, the amount of memory per unit time might be an issue for
> > whatever allocators you are using, and as Matthew has often pointed out,
> > the deferred reclamation incurs additional cache misses.
> 
> So what I'm saying is that in the absensce of something noticing
> excessive memory being stranded and asking for an expedited grace
> period, the only bounds on the amount of memory being stranded will be
> how often RCU grace periods expire in the absence of anyone asking for
> them - that was my question to you. I'm not at all knowledgable on RCU
> internals but I gather it varies with things like dynticks and whether
> or not userspace is calling into the kernel?

It does vary based on quite a few things.  Taking as an example your
dynticks point, a nohz_full CPU does not restart the tick upon entering
the kernel.  Which is what you want for short system calls because
enabling and disabling the tick can be expensive.  But this causes
RCU trouble if a nohz_full CPU executes a long-running system call,
so RCU will force the tick on should such a CPU refused to respond to
RCU quickly enough.

But the typical idle-system normal grace-period duration is a couple of
tens of milliseconds.  Except of course that if there are one-second-long
readers, you will see one-second-long grace periods.

> "gigabytes per second" - if userspace is doing big sequential streaming
> reads that don't fit into cache, we'll be evicting pagecache as quickly
> as we can read it in, so we should only be limited by your SSD
> bandwidth.

Understood.

But why not simply replace the current freeing with kfree_rcu() or
similar and see what really happens?  More further down...

> > And rcutorture really does do forward-progress testing on vanilla RCU
> > that features tight in-kernel loops doing call_rcu() without bounds on
> > memory in flight, and RCU routinely survives this in a VM that is given
> > only 512MB of memory.  In fact, any failure to survive this is considered
> > a bug to be fixed.
> 
> Are you saying there's already feedback between memory reclaim and RCU?

For the callback flooding in the torture test, yes.  There is a
register_oom_notifier() function, and if that function is called during
the callback flooding, that is a test failure.

There used to be a shrinker in the RCU implementation itself, but this
proved unhelpful and was therefore removed.  As in RCU worked fine without
it given the workloads of that time.  Who knows?  Maybe it needs to be
added back in.

> > So I suspect RCU is quite capable of handling this load.  But how many
> > additional kfree_rcu() calls are you anticipating per unit time per CPU?
> > For example, could we simply measure the rate at which pagecache folios
> > are currently being freed under heavy load?  Given that information,
> > we could just try it out.
> 
> It's not the load I'm concerned about, or the number of call_rcu()
> calls, I have no doubt that RCU will cope with that just fine.
> 
> But I do think that we need an additional feedback mechanism here. When
> we're doing big streaming sequential buffered IO, and lots of memory is
> cycled in and out of the pagecache, we have a couple things we want to
> avoid:
> 
> The main thing is that we don't want the amount of memory stranded
> waiting for RCU to grow unbounded and shove everything out of the
> caches; if you're currently just concerned about _deadlock_ that is
> likely insufficient here, we're also concerned about maintaining good
> steady performance under load
> 
> We don't want memory reclaim to be trying harder and harder when the
> correct thing to do is a synchronize_rcu_expedited().
> 
> We _also_ don't want to be hammering on RCU asking for expedited grace
> periods unnecessarily when the number of pending callbacks is high, but
> they're all for unrelated stuff - expedited RCU grace periods aren't
> free either!.
> 
> Does that help clarify things?

Yes, but...

The key point is that no one knows enough about this problem to dive at
all deeply into solution space.  From what you have described thus far,
the solution might involve one or more of the following options:

o	We do nothing, because RCU just handles it.  (Hey, I can dream,
	can't I?)

o	We do a few modifications to RCU's force-quiescent state code.
	This needs to be revisited for lazy preemption anyway, so this
	is a good time to take a look.

o	We do a few modifications to RCU's force-quiescent state code,
	but with the addition of communication of some sort from reclaim.
	This appears to be your current focus.

o	It might be that reclaim needs to use SRCU instead of RCU.
	This would insulate reclaim from other subsystems' overly long
	readers, but at the expense of smp_mb() overhead when entering
	and exiting a reader.  Also, unlike RCU, SRCU has no reasonable
	way of encouraging delayed readers (for example, delayed due
	to preemption).

o	It might be that reclaim needs its own specialized flavor of RCU,
	as did tracing (Tasks RCU) and BPF (Tasks Trace RCU).  Plus there
	is Steve Rostedt's additional flavor for tracing, which is now
	Tasks RCU Rude.  I doubt that any of these three are what you
	want, but let's get some hard data so that we can find out.

o	It might be that reclaim needs some synchronization mechanism
	other than RCU.

We really need some hard data on how this thing affects RCU.  Again,
could we do the kfree()->kfree_rcu() change (and similar change,
depending on what memory allocator is used) and the run some tests to
see how RCU reacts?

If RCU does not react all that well, then we will also need information on
what the RCU readers look like, for example, must they block, must they
be preemptible, and about how long do they run for?  Best if they never
block, need not be preemptible, and run for very short periods of time,
but they are what they are.  Perhaps Matthew Wilcox has some thoughts
on this.

Given that information, we can make at least a semi-informed decision
as to which (if any) of the six possibilities listed above is most
appropriate.

Does that help to communicate my position on this topic?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-29 19:42                                                         ` Paul E. McKenney
@ 2024-02-29 20:51                                                           ` Kent Overstreet
  2024-03-05  2:19                                                             ` Paul E. McKenney
  0 siblings, 1 reply; 87+ messages in thread
From: Kent Overstreet @ 2024-02-29 20:51 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain,
	lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Thu, Feb 29, 2024 at 11:42:53AM -0800, Paul E. McKenney wrote:
> On Wed, Feb 28, 2024 at 06:55:12PM -0500, Kent Overstreet wrote:
> > So what I'm saying is that in the absensce of something noticing
> > excessive memory being stranded and asking for an expedited grace
> > period, the only bounds on the amount of memory being stranded will be
> > how often RCU grace periods expire in the absence of anyone asking for
> > them - that was my question to you. I'm not at all knowledgable on RCU
> > internals but I gather it varies with things like dynticks and whether
> > or not userspace is calling into the kernel?
> 
> It does vary based on quite a few things.  Taking as an example your
> dynticks point, a nohz_full CPU does not restart the tick upon entering
> the kernel.  Which is what you want for short system calls because
> enabling and disabling the tick can be expensive.  But this causes
> RCU trouble if a nohz_full CPU executes a long-running system call,
> so RCU will force the tick on should such a CPU refused to respond to
> RCU quickly enough.

Makes sense - so synchronize_rcu_expedited() has to check that and issue
an IPI?

> But the typical idle-system normal grace-period duration is a couple of
> tens of milliseconds.  Except of course that if there are one-second-long
> readers, you will see one-second-long grace periods.
> 
> > "gigabytes per second" - if userspace is doing big sequential streaming
> > reads that don't fit into cache, we'll be evicting pagecache as quickly
> > as we can read it in, so we should only be limited by your SSD
> > bandwidth.
> 
> Understood.
> 
> But why not simply replace the current freeing with kfree_rcu() or
> similar and see what really happens?  More further down...

Well, mainly because I've been doing filesystem and storage work for 15
years and these kinds of problems are my bread and butter; in filesystem
land (especially bcachefs, but really any modern journalling
filesystem), there are a _lot_ of background reclaim tasks not unlike
this.

So I'm speaking from experience here; the very first thing we're going
to want when we do this for page cache pages and start analyzing system
behaviour and performance is some basic numbers as to what's going on,
and here it's practically no trouble at all to do it right.

> > > And rcutorture really does do forward-progress testing on vanilla RCU
> > > that features tight in-kernel loops doing call_rcu() without bounds on
> > > memory in flight, and RCU routinely survives this in a VM that is given
> > > only 512MB of memory.  In fact, any failure to survive this is considered
> > > a bug to be fixed.
> > 
> > Are you saying there's already feedback between memory reclaim and RCU?
> 
> For the callback flooding in the torture test, yes.  There is a
> register_oom_notifier() function, and if that function is called during
> the callback flooding, that is a test failure.

Based on number of outstanding callbacks?

> There used to be a shrinker in the RCU implementation itself, but this
> proved unhelpful and was therefore removed.  As in RCU worked fine without
> it given the workloads of that time.  Who knows?  Maybe it needs to be
> added back in.

Why was it unhelpful? Were you having problems with the approach, or
just didn't seem to be useful?

> The key point is that no one knows enough about this problem to dive at
> all deeply into solution space.  From what you have described thus far,
> the solution might involve one or more of the following options:
> 
> o	We do nothing, because RCU just handles it.  (Hey, I can dream,
> 	can't I?)
> 
> o	We do a few modifications to RCU's force-quiescent state code.
> 	This needs to be revisited for lazy preemption anyway, so this
> 	is a good time to take a look.
> 
> o	We do a few modifications to RCU's force-quiescent state code,
> 	but with the addition of communication of some sort from reclaim.
> 	This appears to be your current focus.

Not exactly (but I'd like to hear what you have in mind); I'm
envisioning this driven by reclaim itself.

> o	It might be that reclaim needs to use SRCU instead of RCU.
> 	This would insulate reclaim from other subsystems' overly long
> 	readers, but at the expense of smp_mb() overhead when entering
> 	and exiting a reader.  Also, unlike RCU, SRCU has no reasonable
> 	way of encouraging delayed readers (for example, delayed due
> 	to preemption).

Yeah, I don't think SRCU is quite what we want here.

> o	It might be that reclaim needs its own specialized flavor of RCU,
> 	as did tracing (Tasks RCU) and BPF (Tasks Trace RCU).  Plus there
> 	is Steve Rostedt's additional flavor for tracing, which is now
> 	Tasks RCU Rude.  I doubt that any of these three are what you
> 	want, but let's get some hard data so that we can find out.
> 
> o	It might be that reclaim needs some synchronization mechanism
> 	other than RCU.
> 
> We really need some hard data on how this thing affects RCU.  Again,
> could we do the kfree()->kfree_rcu() change (and similar change,
> depending on what memory allocator is used) and the run some tests to
> see how RCU reacts?

Well, like I mentioned previously - the "hard data" you say we need is
exactly what I was saying we need to collect as a first order of
business - we need to know the amount of memory stranded by RCU, and we
need to be able to watch that number as we run through different
buffered IO workloads.

Sure, we could do this as a special case for just pagecache page
freeing; but it looks like zero extra work to just do it right and do it
for kfree_rcu() in general; and I've already talked about how minimal
the perforance cost is.

And more broadly, we have real gaps when it comes to OOM debugging and
being able to report on who's responsible for what memory and how much,
and that's something I want to close. I've got a shrinker debugging
patch that I really need to get back to and shove in, and really in that
area I need to go further and add byte counters to shrinker reporting.

IOW, this isn't just an RCU topic, this is a memory reclaim topic, and
lots of us filesystem developers have... things to say on the subject of
memory reclaim. Being able to get proper information so that we can
debug stuff has been a major point of contention; I hope you can
try to appreciate where we're coming from.

> If RCU does not react all that well, then we will also need information on
> what the RCU readers look like, for example, must they block, must they
> be preemptible, and about how long do they run for?  Best if they never
> block, need not be preemptible, and run for very short periods of time,
> but they are what they are.  Perhaps Matthew Wilcox has some thoughts
> on this.

We'll be using this in filemap_read() (and I've worked on that code more
than most); we're already using rcu_read_lock() there for the xarray
lookup, which is partly why RCU (the regular sort) is a natural fit for
protecting the data itself.

This will move the copy_to_user_nofault() under rcu_read_lock(), so
we're not blocking or faulting or preempting or anything like that. Just
doing a (potentially big) copy to userspace.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-29  0:57             ` Kent Overstreet
@ 2024-03-04  0:46               ` Dave Chinner
  0 siblings, 0 replies; 87+ messages in thread
From: Dave Chinner @ 2024-03-04  0:46 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Amir Goldstein, Pankaj Raghav, Jens Axboe, Chris Mason,
	Matthew Wilcox, Daniel Gomez, linux-mm, Luis Chamberlain,
	Johannes Weiner, linux-fsdevel, lsf-pc, Linus Torvalds,
	Christoph Hellwig, Josef Bacik, Jan Kara

On Wed, Feb 28, 2024 at 07:57:38PM -0500, Kent Overstreet wrote:
> On Thu, Feb 29, 2024 at 11:25:33AM +1100, Dave Chinner wrote:
> > > That's doable - I can try to do that.
> > > What is your take regarding opt-in/opt-out of legacy behavior?
> > 
> > Screw the legacy code, don't even make it an option. No-one should
> > be relying on large buffered writes being atomic anymore, and with
> > high order folios in the page cache most small buffered writes are
> > going to be atomic w.r.t. both reads and writes anyway.
> 
> That's a new take...
> 
> > 
> > > At the time, I have proposed POSIX_FADV_TORN_RW API [1]
> > > to opt-out of the legacy POSIX behavior, but I guess that an xfs mount
> > > option would make more sense for consistent and clear semantics across
> > > the fs - it is easier if all buffered IO to inode behaved the same way.
> > 
> > No mount options, just change the behaviour. Applications already
> > have to avoid concurrent overlapping buffered reads and writes if
> > they care about data integrity and coherency, so making buffered
> > writes concurrent doesn't change anything.
> 
> Honestly - no.
> 
> Userspace would really like to see some sort of definition for this kind
> of behaviour, and if we just change things underneath them without
> telling anyone, _that's a dick move_.

I don't think you understand the full picture here, Kent.

> POSIX_FADV_TORN_RW is a terrible name, though.

The described behaviour for this advice is the standard behaviour
for ext4, btrfs and most linux filesystems other than XFS. It has
been for a -long- time.

The only filesystem that gives anything resembling POSIX atomic
write behaviour is XFS. No other filesystem in Linux actually
provides the POSIX "buffered reads won't see partial data from
buffered writes in progress" behaviour that XFS does via the IOLOCK
behaviour it uses.

So when I say "screw the legacy apps" I'm talking about the ancient
enterprise applications that still behave as if this POSIX behaviour
is reliable on modern linux systems. It simply isn't, and these apps
are *already implicitly broken* on most Linux filesystems and they
need fixing.

> And fadvise() is the wrong API for this because it applies to ranges,
> this should be an open flag or an fcntl.

Not only is it the wrong API, it's also the wrong approach to take.
We have a new API coming through for atomic writes: RWF_ATOMIC.

If an applications needs an actual atomic IO guarantee, they will
soon be able to be explicit in their requirements and they will not
end up in the situation where the filesystem they use might
determine if there is an implicit atomic write behaviour provided.

Indeed, we don't actually say that XFS will always guarantee POSIX
atomic buffered IO semantics - we've just never decided that the
time is right to change the behaviour.

In making such a change to XFS, normal buffered writes will get
mostly the same behaviour as they do now because we now use high
order folios in the page cache and serialisation will be done
against high-order ranges rather than individual pages. And
applications that actually need atomic IO semantics can use
RWF_ATOMIC and in that case we can do explicitly serialised buffered
writes that lock out concurrent buffered reads as we do right now.

IOWs, there is no better time to convert XFS behaviour to match all
the other Linux filesystems than right now. Applications that need
atomic IO guarantees can use RWF_ATOMIC, and everyone else can get
the performance benefits that come from no longer trying to make
buffered IO implicitly "atomic".

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-29 20:51                                                           ` Kent Overstreet
@ 2024-03-05  2:19                                                             ` Paul E. McKenney
  0 siblings, 0 replies; 87+ messages in thread
From: Paul E. McKenney @ 2024-03-05  2:19 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain,
	lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Thu, Feb 29, 2024 at 03:51:08PM -0500, Kent Overstreet wrote:
> On Thu, Feb 29, 2024 at 11:42:53AM -0800, Paul E. McKenney wrote:
> > On Wed, Feb 28, 2024 at 06:55:12PM -0500, Kent Overstreet wrote:
> > > So what I'm saying is that in the absensce of something noticing
> > > excessive memory being stranded and asking for an expedited grace
> > > period, the only bounds on the amount of memory being stranded will be
> > > how often RCU grace periods expire in the absence of anyone asking for
> > > them - that was my question to you. I'm not at all knowledgable on RCU
> > > internals but I gather it varies with things like dynticks and whether
> > > or not userspace is calling into the kernel?
> > 
> > It does vary based on quite a few things.  Taking as an example your
> > dynticks point, a nohz_full CPU does not restart the tick upon entering
> > the kernel.  Which is what you want for short system calls because
> > enabling and disabling the tick can be expensive.  But this causes
> > RCU trouble if a nohz_full CPU executes a long-running system call,
> > so RCU will force the tick on should such a CPU refused to respond to
> > RCU quickly enough.
> 
> Makes sense - so synchronize_rcu_expedited() has to check that and issue
> an IPI?

No and yes, in that synchronize_rcu_expedited() sends an IPI to every
CPU that is not either deep in the idle loop, in nohz_full userspace,
or offline.  A nohz_full CPU that is in a long system call will thus
get an IPI, which means that synchronize_rcu_expedited() need not rely
on the scheduling-clock interrupt.

In contrast, synchronize_rcu() (non-expedited) might well need to
re-enable the tick on this CPU in this case.

> > But the typical idle-system normal grace-period duration is a couple of
> > tens of milliseconds.  Except of course that if there are one-second-long
> > readers, you will see one-second-long grace periods.
> > 
> > > "gigabytes per second" - if userspace is doing big sequential streaming
> > > reads that don't fit into cache, we'll be evicting pagecache as quickly
> > > as we can read it in, so we should only be limited by your SSD
> > > bandwidth.
> > 
> > Understood.
> > 
> > But why not simply replace the current freeing with kfree_rcu() or
> > similar and see what really happens?  More further down...
> 
> Well, mainly because I've been doing filesystem and storage work for 15
> years and these kinds of problems are my bread and butter; in filesystem
> land (especially bcachefs, but really any modern journalling
> filesystem), there are a _lot_ of background reclaim tasks not unlike
> this.
> 
> So I'm speaking from experience here; the very first thing we're going
> to want when we do this for page cache pages and start analyzing system
> behaviour and performance is some basic numbers as to what's going on,
> and here it's practically no trouble at all to do it right.

I am not questioning the depth of your filesystem and storage experience.

However, the definition of "right" can vary widely.  In this case, "right"
might mean that we absolutely must do yet another special-purpose RCU.
I hope not, but if this is in fact the case, I need to learn that before
spending huge amounts of time attempting to tweak vanilla RCU into doing
something that it is incapable of doing.

> > > > And rcutorture really does do forward-progress testing on vanilla RCU
> > > > that features tight in-kernel loops doing call_rcu() without bounds on
> > > > memory in flight, and RCU routinely survives this in a VM that is given
> > > > only 512MB of memory.  In fact, any failure to survive this is considered
> > > > a bug to be fixed.
> > > 
> > > Are you saying there's already feedback between memory reclaim and RCU?
> > 
> > For the callback flooding in the torture test, yes.  There is a
> > register_oom_notifier() function, and if that function is called during
> > the callback flooding, that is a test failure.
> 
> Based on number of outstanding callbacks?

I am not sure what you are asking.  The OOM killer doesn't care about the
number of callbacks.  The callback flooding for vanilla RCU's torture
testing does not limit the number of callbacks.  Instead, it continues
allocating memory and passing it to call_rcu() until either the OOM
notification arrives (which is a bug in RCU) or until a reasonable number
of callbacks have completed their grace periods.

It does cheat in that the RCU callback function hands the memory directly
to rcutorture instead of taking a trip through the memory allocator.
It does this to focus the torturing on RCU instead of letting RCU hide
behind the memory allocator.

> > There used to be a shrinker in the RCU implementation itself, but this
> > proved unhelpful and was therefore removed.  As in RCU worked fine without
> > it given the workloads of that time.  Who knows?  Maybe it needs to be
> > added back in.
> 
> Why was it unhelpful? Were you having problems with the approach, or
> just didn't seem to be useful?

If I recall correctly, the RCU shrinker was counterproductive because
the grace periods completed quickly enough on their own, so there was
no point in burning the additional CPU.

But times change, so maybe it is time to put a shrinker back in.

> > The key point is that no one knows enough about this problem to dive at
> > all deeply into solution space.  From what you have described thus far,
> > the solution might involve one or more of the following options:
> > 
> > o	We do nothing, because RCU just handles it.  (Hey, I can dream,
> > 	can't I?)
> > 
> > o	We do a few modifications to RCU's force-quiescent state code.
> > 	This needs to be revisited for lazy preemption anyway, so this
> > 	is a good time to take a look.
> > 
> > o	We do a few modifications to RCU's force-quiescent state code,
> > 	but with the addition of communication of some sort from reclaim.
> > 	This appears to be your current focus.
> 
> Not exactly (but I'd like to hear what you have in mind); I'm
> envisioning this driven by reclaim itself.

I am wide open.  So much so that I consider your thought of driving RCU
from reclaim as a special case of communication of some sort from reclaim.

Again, I would like to see actual data.  This is not the simple part
of RCU.  In contrast, some years back when I immediately agreed to add
polling interfaces to SRCU, it was a simple addition to RCU that I had
long known would eventually be needed.

> > o	It might be that reclaim needs to use SRCU instead of RCU.
> > 	This would insulate reclaim from other subsystems' overly long
> > 	readers, but at the expense of smp_mb() overhead when entering
> > 	and exiting a reader.  Also, unlike RCU, SRCU has no reasonable
> > 	way of encouraging delayed readers (for example, delayed due
> > 	to preemption).
> 
> Yeah, I don't think SRCU is quite what we want here.

Given that you have actually used SRCU, I trust your judgment more than
I would trust most other people's.  But long and painful experience has
caused me to trust the judgment of real hardware running real software
quite a bit more than I trust that of any person, my own judgment
included.

> > o	It might be that reclaim needs its own specialized flavor of RCU,
> > 	as did tracing (Tasks RCU) and BPF (Tasks Trace RCU).  Plus there
> > 	is Steve Rostedt's additional flavor for tracing, which is now
> > 	Tasks RCU Rude.  I doubt that any of these three are what you
> > 	want, but let's get some hard data so that we can find out.
> > 
> > o	It might be that reclaim needs some synchronization mechanism
> > 	other than RCU.

You are free to ignore these last two for the time being, but they might
well prove necessary.  So I cannot ignore them.

> > We really need some hard data on how this thing affects RCU.  Again,
> > could we do the kfree()->kfree_rcu() change (and similar change,
> > depending on what memory allocator is used) and the run some tests to
> > see how RCU reacts?
> 
> Well, like I mentioned previously - the "hard data" you say we need is
> exactly what I was saying we need to collect as a first order of
> business - we need to know the amount of memory stranded by RCU, and we
> need to be able to watch that number as we run through different
> buffered IO workloads.

Good!  Please do it!  Change those kfree() calls to kfree_rcu() and
let's see whether or not the system OOMs.

> Sure, we could do this as a special case for just pagecache page
> freeing; but it looks like zero extra work to just do it right and do it
> for kfree_rcu() in general; and I've already talked about how minimal
> the perforance cost is.

Kent, kfree_rcu() is itself a special case.

There is call_rcu(), synchronize_rcu(), synchronize_rcu_expedited(), as
well as all of the polling interfaces, none of which have any reasonable
way to know the corresponding size of data.  And a lot of the RCU traffic
typically goes through call_rcu() and synchronize_rcu().

> And more broadly, we have real gaps when it comes to OOM debugging and
> being able to report on who's responsible for what memory and how much,
> and that's something I want to close. I've got a shrinker debugging
> patch that I really need to get back to and shove in, and really in that
> area I need to go further and add byte counters to shrinker reporting.
> 
> IOW, this isn't just an RCU topic, this is a memory reclaim topic, and
> lots of us filesystem developers have... things to say on the subject of
> memory reclaim. Being able to get proper information so that we can
> debug stuff has been a major point of contention; I hope you can
> try to appreciate where we're coming from.

I am trying, Kent.  I really am.  I am absolutely not succeeding.

Without the kfree()->kfree_rcu() testing, how do you even know that
vanilla RCU is in fact the right solution?  The answer is that you
absolutely do not know one way or the other.

Please keep in mind that we have had several cases where vanilla RCU
was not the right answer, which is why SRCU, Tasks RCU, Tasks Trace RCU,
and Tasks Rude RCU exist.  Plus there has been code that uses workqueues
as a type of RCU.  And if you are going to do something like free half
of memory in one shot and be unable to re-use that memory until after a
normal RCU grace period elapses, we probably need something quite special.
Maybe things never get that extreme, but you have been emphasizing tens
of gigabytes, and that would likely require something special on even a
64GB system.  Or maybe you would only do tens of gigabytes on a system
with terabytes of memory, in which case I am not quite so worried.

Which brings us to how do you even know that RCU needs to change at all?
Again, the answer is that you absolutely do not know one way or the other.

> > If RCU does not react all that well, then we will also need information on
> > what the RCU readers look like, for example, must they block, must they
> > be preemptible, and about how long do they run for?  Best if they never
> > block, need not be preemptible, and run for very short periods of time,
> > but they are what they are.  Perhaps Matthew Wilcox has some thoughts
> > on this.
> 
> We'll be using this in filemap_read() (and I've worked on that code more
> than most); we're already using rcu_read_lock() there for the xarray
> lookup, which is partly why RCU (the regular sort) is a natural fit for
> protecting the data itself.
> 
> This will move the copy_to_user_nofault() under rcu_read_lock(), so
> we're not blocking or faulting or preempting or anything like that. Just
> doing a (potentially big) copy to userspace.

Thank you for the heads up.  Are these also things that you believe
that RCU cannot handle without modifications?  If so, it would be good
to test these as well.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-24 22:57         ` Chris Mason
  2024-02-24 23:40           ` Linus Torvalds
@ 2024-05-10 23:57           ` Luis Chamberlain
  1 sibling, 0 replies; 87+ messages in thread
From: Luis Chamberlain @ 2024-05-10 23:57 UTC (permalink / raw)
  To: Chris Mason, Dave Chinner, David Bueso, Kent Overstreet,
	Paul E. McKenney
  Cc: Linus Torvalds, Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm,
	Daniel Gomez, Pankaj Raghav, Jens Axboe, Christoph Hellwig,
	Chris Mason, Johannes Weiner

On Sat, Feb 24, 2024 at 05:57:43PM -0500, Chris Mason wrote:
> Going back to Luis's original email, I'd echo Willy's suggestion for
> profiles.  Unless we're saturating memory bandwidth, buffered should be
> able to get much closer to O_DIRECT, just at a much higher overall cost.

I finally had some time to look beyond just "what locks" could be the
main culprit, David Bueso helped me review this, thanks!

Based on all the discussions on this insanely long thread, I do believe
the issue was the single threaded write-behind cache flushing back Chinner
noted.

Lifting the /proc/sys/vm/dirty_ratio from 20 to 90 keeps the profile
perky and nice with most top penalties seen just in userspace as seen in
the first paste exhibit a) below but as soon as we start throttling we
hit the profile on past exhibit b) below.

a) without the throttling:

Samples: 1M of event 'cycles:P', Event count (approx.): 1061541571785                                                                                                                         
  Children      Self  Command          Shared Object               Symbol                                                                                                                     
+   17.05%    16.85%  fio              fio                         [.] get_io_u                                                                                                              ◆
+    3.04%     0.01%  fio              [kernel.vmlinux]            [k] entry_SYSCALL_64                                                                                                      ▒
+    3.03%     0.02%  fio              [kernel.vmlinux]            [k] do_syscall_64                                                                                                         ▒
+    1.39%     0.04%  fio              [kernel.vmlinux]            [k] __do_sys_io_uring_enter                                                                                               ▒
+    1.33%     0.00%  fio              libc.so.6                   [.] __GI___libc_open                                                                                                      ▒
+    1.33%     0.00%  fio              [kernel.vmlinux]            [k] __x64_sys_openat                                                                                                      ▒
+    1.33%     0.00%  fio              [kernel.vmlinux]            [k] do_sys_openat2                                                                                                        ▒
+    1.33%     0.00%  fio              [unknown]                   [k] 0x312d6e65742f2f6d                                                                                                    ▒
+    1.33%     0.00%  fio              [kernel.vmlinux]            [k] do_filp_open                                                                                                          ▒
+    1.33%     0.00%  fio              [kernel.vmlinux]            [k] path_openat                                                                                                           ▒
+    1.29%     0.00%  fio              [kernel.vmlinux]            [k] down_write                                                                                                            ▒
+    1.29%     0.00%  fio              [kernel.vmlinux]            [k] rwsem_down_write_slowpath                                                                                             ▒
+    1.26%     1.25%  fio              [kernel.vmlinux]            [k] osq_lock                                                                                                              ▒
+    1.14%     0.00%  fio              fio                         [.] 0x000055bbb94449fa                                                                                                    ▒
+    1.14%     1.14%  fio              fio                         [.] 0x000000000002a9f5                                                                                                    ▒
+    0.98%     0.00%  fio              [unknown]                   [k] 0x000055bbd6310520                                                                                                    ▒
+    0.93%     0.00%  fio              fio                         [.] 0x000055bbb94b197b                                                                                                    ▒
+    0.89%     0.00%  perf             libc.so.6                   [.] __GI___libc_write                                                                                                     ▒
+    0.89%     0.00%  perf             [kernel.vmlinux]            [k] entry_SYSCALL_64                                                                                                      ▒
+    0.88%     0.00%  perf             [kernel.vmlinux]            [k] do_syscall_64                                                                                                         ▒
+    0.86%     0.00%  perf             [kernel.vmlinux]            [k] ksys_write                                                                                                            ▒
+    0.85%     0.01%  perf             [kernel.vmlinux]            [k] vfs_write                                                                                                             ▒
+    0.83%     0.00%  perf             [ext4]                      [k] ext4_buffered_write_iter                                                                                              ▒
+    0.81%     0.01%  perf             [kernel.vmlinux]            [k] generic_perform_write                                                                                                 ▒
+    0.77%     0.02%  fio              [kernel.vmlinux]            [k] io_submit_sqes                                                                                                        ▒
+    0.76%     0.00%  kworker/u513:26  [kernel.vmlinux]            [k] ret_from_fork_asm                                                                                                     ▒
+    0.76%     0.00%  kworker/u513:26  [kernel.vmlinux]            [k] ret_from_fork                                                                                                         ▒
+    0.76%     0.00%  kworker/u513:26  [kernel.vmlinux]            [k] kthread                                                                                                               ▒
+    0.76%     0.00%  kworker/u513:26  [kernel.vmlinux]            [k] worker_thread                                                                                                         ▒
+    0.76%     0.00%  kworker/u513:26  [kernel.vmlinux]            [k] process_one_work                                                                                                      ▒
+    0.76%     0.00%  kworker/u513:26  [kernel.vmlinux]            [k] wb_workfn                                                                                                             ▒
+    0.76%     0.00%  kworker/u513:26  [kernel.vmlinux]            [k] wb_writeback                                                                                                          ▒
+    0.76%     0.00%  kworker/u513:26  [kernel.vmlinux]            [k] __writeback_inodes_wb                                                                                                 ▒
+    0.76%     0.00%  kworker/u513:26  [kernel.vmlinux]            [k] writeback_sb_inodes                                                                                                   ▒
+    0.76%     0.00%  kworker/u513:26  [kernel.vmlinux]            [k] __writeback_single_inode                                                                                              ▒
+    0.76%     0.00%  kworker/u513:26  [kernel.vmlinux]            [k] do_writepages                                                                                                         ▒
+    0.76%     0.00%  kworker/u513:26  [xfs]                       [k] xfs_vm_writepages                                                                                                     ▒
+    0.75%     0.00%  kworker/u513:26  [kernel.vmlinux]            [k] submit_bio_noacct_nocheck                                                                                             ▒
+    0.75%     0.00%  kworker/u513:26  [kernel.vmlinux]            [k] iomap_submit_ioend          

So we see *more* penalty because of perf's own buffered IO writes of the
perf data than any writeback from from XFS.

a) when we hit throttling:

Samples: 1M of event 'cycles:P', Event count (approx.): 816903693659                                                                                                                          
  Children      Self  Command          Shared Object               Symbol                                                                                                                     
+   14.24%    14.06%  fio              fio                         [.] get_io_u                                                                                                              ◆
+    4.88%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] ret_from_fork_asm                                                                                                     ▒
+    4.88%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] ret_from_fork                                                                                                         ▒
+    4.88%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] kthread                                                                                                               ▒
+    4.88%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] worker_thread                                                                                                         ▒
+    4.88%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] process_one_work                                                                                                      ▒
+    4.88%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] wb_workfn                                                                                                             ▒
+    4.88%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] wb_writeback                                                                                                          ▒
+    4.88%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] __writeback_inodes_wb                                                                                                 ▒
+    4.88%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] writeback_sb_inodes                                                                                                   ▒
+    4.87%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] __writeback_single_inode                                                                                              ▒
+    4.87%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] do_writepages                                                                                                         ▒
+    4.87%     0.00%  kworker/u513:3-  [xfs]                       [k] xfs_vm_writepages                                                                                                     ▒
+    4.82%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] iomap_submit_ioend                                                                                                    ▒
+    4.82%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] submit_bio_noacct_nocheck                                                                                             ▒
+    4.82%     0.00%  kworker/u513:3-  [kernel.vmlinux]            [k] __submit_bio                                                                                                          ▒
+    4.82%     0.04%  kworker/u513:3-  [nd_pmem]                   [k] pmem_submit_bio                                                                                                       ▒
+    4.78%     0.05%  kworker/u513:3-  [nd_pmem]                   [k] pmem_do_write          

Although my focus was on measuring the limits of the page cache, this
thread also had a *slew* of ideas on how to improve that status quo,
pathological or not. We have to accept some workloads are clearly
pathological, but that's the point in coming up with limits and testing
the page cache. But since there were a slew of unexpected ideas spread
out this entire thread about general improvements, even for general use
cases, I've collected all of them and put them as notes for for review
for this topic at LSFMM.

Thanks all for the feedback!

  Luis

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-26 22:46                       ` Linus Torvalds
  2024-02-26 23:48                         ` Linus Torvalds
@ 2024-05-14 11:52                         ` Luis Chamberlain
  2024-05-14 16:04                           ` Linus Torvalds
  1 sibling, 1 reply; 87+ messages in thread
From: Luis Chamberlain @ 2024-05-14 11:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Kent Overstreet, Matthew Wilcox, lsf-pc, linux-fsdevel,
	linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Mon, Feb 26, 2024 at 02:46:56PM -0800, Linus Torvalds wrote:
> I really haven't tested this AT ALL. I'm much too scared. But I don't
> actually hate how the code looks nearly as much as I *thought* I'd
> hate it.

Thanks for this, obviously those interested in this will have to test
this and fix the below issues. I've tested for regressions just against
xfs on 4k reflink profile and detected only two failures, generic/095
fails with a failure rate of about 1/2 or so:

  * generic/095
  * generic/741

For details refer to the test result archive [0]. To reproduce with
kdevops:

make defconfig-xfs_reflink_4k KDEVOPS_HOSTS_PREFIX=hacks B4_MESSAGE_ID=20240514084221.3664475-1-mcgrof@kernel.org
make -j $(nproc)
make bringup
make fstests
make linux
make fstests-baseline TESTS=generic/095 COUNT=10

[0] https://github.com/linux-kdevops/kdevops-results-archive/commit/27186b2782af4af559524539a2f6058d12361e10

  Luis

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-05-14 11:52                         ` Luis Chamberlain
@ 2024-05-14 16:04                           ` Linus Torvalds
  0 siblings, 0 replies; 87+ messages in thread
From: Linus Torvalds @ 2024-05-14 16:04 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Al Viro, Kent Overstreet, Matthew Wilcox, lsf-pc, linux-fsdevel,
	linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Chris Mason, Johannes Weiner

On Tue, 14 May 2024 at 04:52, Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Mon, Feb 26, 2024 at 02:46:56PM -0800, Linus Torvalds wrote:
> > I really haven't tested this AT ALL. I'm much too scared. But I don't
> > actually hate how the code looks nearly as much as I *thought* I'd
> > hate it.
>
> Thanks for this, obviously those interested in this will have to test
> this and fix the below issues. I've tested for regressions just against
> xfs on 4k reflink profile and detected only two failures, generic/095
> fails with a failure rate of about 1/2 or so:
>
>   * generic/095
>   * generic/741

Funky.

I do *not* see how those can fail due to the change, but that's the
point of testing.

Somebody who knows those two tests better, and figures out what the
difference is would have to get involved.

One different thing that my fast-read case does is that it does *NOT*
do the crazy dcache coherency thing that the "full" case does, ie the

                writably_mapped = mapping_writably_mapped(mapping);
                ...
                        /*
                         * If users can be writing to this folio using arbitrary
                         * virtual addresses, take care of potential aliasing
                         * before reading the folio on the kernel side.
                         */
                        if (writably_mapped)
                                flush_dcache_folio(folio);

but that shouldn't matter on any sane architecture. Sadly, even arm64
counts as "insane" here, because it does the I$ sync using
flush_dcache_folio().

I can't tell what architecture the testing was done on, but I assume
it was x86, and I assume the above detail is _not_ the cause.

               Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

end of thread, other threads:[~2024-05-14 16:04 UTC | newest]

Thread overview: 87+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-23 23:59 [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO Luis Chamberlain
2024-02-24  4:12 ` Matthew Wilcox
2024-02-24 17:31   ` Linus Torvalds
2024-02-24 18:13     ` Matthew Wilcox
2024-02-24 18:24       ` Linus Torvalds
2024-02-24 18:20     ` Linus Torvalds
2024-02-24 19:11       ` Linus Torvalds
2024-02-24 21:42         ` Theodore Ts'o
2024-02-24 22:57         ` Chris Mason
2024-02-24 23:40           ` Linus Torvalds
2024-05-10 23:57           ` Luis Chamberlain
2024-02-25  5:18     ` Kent Overstreet
2024-02-25  6:04       ` Kent Overstreet
2024-02-25 13:10       ` Matthew Wilcox
2024-02-25 17:03         ` Linus Torvalds
2024-02-25 21:14           ` Matthew Wilcox
2024-02-25 23:45             ` Linus Torvalds
2024-02-26  1:02               ` Kent Overstreet
2024-02-26  1:32                 ` Linus Torvalds
2024-02-26  1:58                   ` Kent Overstreet
2024-02-26  2:06                     ` Kent Overstreet
2024-02-26  2:34                     ` Linus Torvalds
2024-02-26  2:50                   ` Al Viro
2024-02-26 17:17                     ` Linus Torvalds
2024-02-26 21:07                       ` Matthew Wilcox
2024-02-26 21:17                         ` Kent Overstreet
2024-02-26 21:19                           ` Kent Overstreet
2024-02-26 21:55                             ` Paul E. McKenney
2024-02-26 23:29                               ` Kent Overstreet
2024-02-27  0:05                                 ` Paul E. McKenney
2024-02-27  0:29                                   ` Kent Overstreet
2024-02-27  0:55                                     ` Paul E. McKenney
2024-02-27  1:08                                       ` Kent Overstreet
2024-02-27  5:17                                         ` Paul E. McKenney
2024-02-27  6:21                                           ` Kent Overstreet
2024-02-27 15:32                                             ` Paul E. McKenney
2024-02-27 15:52                                               ` Kent Overstreet
2024-02-27 16:06                                                 ` Paul E. McKenney
2024-02-27 15:54                                               ` Matthew Wilcox
2024-02-27 16:21                                                 ` Paul E. McKenney
2024-02-27 16:34                                                   ` Kent Overstreet
2024-02-27 17:58                                                     ` Paul E. McKenney
2024-02-28 23:55                                                       ` Kent Overstreet
2024-02-29 19:42                                                         ` Paul E. McKenney
2024-02-29 20:51                                                           ` Kent Overstreet
2024-03-05  2:19                                                             ` Paul E. McKenney
2024-02-27  0:43                                 ` Dave Chinner
2024-02-26 22:46                       ` Linus Torvalds
2024-02-26 23:48                         ` Linus Torvalds
2024-02-27  7:21                           ` Kent Overstreet
2024-02-27 15:39                             ` Matthew Wilcox
2024-02-27 15:54                               ` Kent Overstreet
2024-02-27 16:34                             ` Linus Torvalds
2024-02-27 16:47                               ` Kent Overstreet
2024-02-27 17:07                                 ` Linus Torvalds
2024-02-27 17:20                                   ` Kent Overstreet
2024-02-27 18:02                                     ` Linus Torvalds
2024-05-14 11:52                         ` Luis Chamberlain
2024-05-14 16:04                           ` Linus Torvalds
2024-02-25 21:29           ` Kent Overstreet
2024-02-25 17:32         ` Kent Overstreet
2024-02-24 17:55   ` Luis Chamberlain
2024-02-25  5:24 ` Kent Overstreet
2024-02-26 12:22 ` Dave Chinner
2024-02-27 10:07 ` Kent Overstreet
2024-02-27 14:08   ` Luis Chamberlain
2024-02-27 14:57     ` Kent Overstreet
2024-02-27 22:13   ` Dave Chinner
2024-02-27 22:21     ` Kent Overstreet
2024-02-27 22:42       ` Dave Chinner
2024-02-28  7:48         ` [Lsf-pc] " Amir Goldstein
2024-02-28 14:01           ` Chris Mason
2024-02-29  0:25           ` Dave Chinner
2024-02-29  0:57             ` Kent Overstreet
2024-03-04  0:46               ` Dave Chinner
2024-02-27 22:46       ` Linus Torvalds
2024-02-27 23:00         ` Linus Torvalds
2024-02-28  2:22         ` Kent Overstreet
2024-02-28  3:00           ` Matthew Wilcox
2024-02-28  4:22             ` Matthew Wilcox
2024-02-28 17:34               ` Kent Overstreet
2024-02-28 18:04                 ` Matthew Wilcox
2024-02-28 18:18         ` Kent Overstreet
2024-02-28 19:09           ` Linus Torvalds
2024-02-28 19:29             ` Kent Overstreet
2024-02-28 20:17               ` Linus Torvalds
2024-02-28 23:21                 ` Kent Overstreet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).